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Killer Chips of the 21st Century 



FOREWORD 
John C. Dvorak 



The AMD-K6® 3D processor is a miraculous chip. Not only because it incorporates a 
new design technology and in many ways modernizes the way chips will be designed 
in the future, but because it reestablishes a key second-source for microprocessors. 
This is critical to the growth of the industry. 

This excellent book and the associated CD-ROM tell you everything you ever wanted 
to know about the new AMD-K6 3D processor, and then some. 

The AMD-K6 3D processor operates at internal clock speeds of 300 MHz and above 
with a 100-MHz bus. The version of the AMD-K6 3D that is due out before the end of 
1998 will operate at 400 MHz with a 100-MHz bus and a backside on-chip full-speed 
L2 cache that runs at the same frequency as the internal processor clock. All of this 
tremendous performance takes advantage of the cost-effective Socket 7 and Super?"™ 
platforms. This stretches the useful life of this technology and saves everyone money 
in the process. Add the fact that the AMD-K6 3D 100-MHz motherboards are 
implementing AGP technology and you have a real-world, cost-effective Pentitun® H 
competitor. 

And then there's AMD's 3D technology. This puppy provides a huge improvement in 
video and multimedia performance. Once again AMD is leading the way in 
developing and producing an important enhancement to the x86 architecture. 

Fm convinced that if it wasn't for AMD and others, you'd be paying twice as much for 
basic Pentium MMX™ processors, and the Pentium H would probably just be 
entering the marketplace at astronomical price levels. Competition is good for the 
consumer. The competition that AMD provides in the PC market makes Intel stay on 
their toes instead of resting on their laurels. 
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The 3D and multimedia markets are in the midst of rapid growth. Last year's MMX 
technology started to address this, but most see it as too little and too late. Fact is, 
there is plenty of room for improvement. MMX doesn't address floating-point 
computations, the heart and soul of 3D graphics and advanced multimedia 
applications. The AMD-K6 3D processor with 3D technology confronts this need by 
implementing a comprehensive set of advanced 3D instructions focused primarily on 
floating-point operations. These new instructions were developed based on input 
from the leading game developers, who told AMD exactly what they needed to make 
their software significantly faster. AMD listened to them and they have implemented 
a real improvement to the x86 instruction set. You will be able to see this 
improvement in the performance of the AMD-K6 3D processor. The improvement is 
not trivial. 

A single AMD-K6 3D instruction can perform two 32-bit floating-point operations in 
one processor clock cycle. And the AMD-K6 3D processor has two pipelines for 3D 
instructions, which means that the processor can execute two 3D instructions per 
clock for throughput of up to four floating-point operations per clock cycle. That 
equates to 1.2 GigaFlops on an AMD-K6 3D/300 processor compared to only 0.3 
GigaFlop capability on a Pentium 11 300. By the way, you can actually see the internal 
operation of the AMD-K6 3D processor with the simulator that is included on the 
CD-ROM that comes with this book. If you ever wondered how a RISC processor 
manages to execute x86 CISC code, the simulator will show you some of the 
under-the-hood techniques that make it all possible. Fascinating. 

And things won't stop with the AMD-K6 3D processor. AMD is not a one-trick pony. 
The AMD-K7^^ is expected to have first silicon soon and be in production in 1999. 
AMD will continue to provide viable competition to Intel well into the 21st century, 
and that competition wiU insure that you will be able to buy leading-edge technology 
at the lowest possible price. As far as Fm concerned this is the kind of competition 
that keeps things interesting. This wiU be a fun battle to watch and in the meantime 
we all benefit from these great new chips. 

John C. Dvorak 

Berkeley, California 1998 
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About This Book and CD-ROM 



The Book 

At the time of publication, AMD had not made final naming decisions for the 
processor and the 3D technology. The names used in this book are the AMD code 
names for the processor and the 3D technology. 

Refer to Appendix B, "Code Optimization" on page 455 for details regarding the 
examples shown in the AMD-K6 3D simulator, especially the tables beginning with 
Table 87 on page 468. 

Refer to the AMD web site at www.amd.com for updates to material related to this 
book, including new scripts and updates for the AMD-K6 3D simulator. 

CD-ROM Contents 

The CD-ROM included with this book contains the following: 

■ AMD-K6 3D processor simulator 

■ All AMD processor technical documentation in Adobe Acrobat PDF format 
n Adobe Acrobat Reader for most platforms 

Mtaimnm System Reqnirements 

The following minimum system is required in order to run the AMD-K6 3D processor 
simulator: 

n A 133-MHz AMD-K6 processor 

■ 16 Mbytes of memory 

■ Windows® 95 or Windows NT™ 4.0 operating system 

■ 256-color SVGA graphics mode video 

■ Video resolution of 800 by 600 

We recommend the following system for maximum enjoyment when using the 

simulator: 

■ A 166-MHz AMD-K6 processor or better 

■ 32 Mbytes of memory 

■ Windows 95 or Windows NT 4.0 operating system 
n 65,536-color graphics mode or better 

n Video resolution of 1024 by 768 
n A sound card with speakers 

Installation of the CD-ROM 

A setup program is provided for your convenience. Run install.exe from the root 
directory of the CD-ROM, and the installation program will step you through installing 
the simulator and the Acrobat reader, if you need it. 
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AMD-K6 3D Processor 

■ Advanced 6-Issue RISC86® Superscalar Microarchitecture 

♦ Ten parallel specialized execution units 

♦ Multiple sophisticated x86-to-RISC86 instruction decoders 

♦ Advanced two-level branch prediction 

♦ Speculative execution 

♦ Out-of-order execution 

♦ Register renaming and data forwarding 

♦ Issues up to six RISC86 instructions per clock 

■ Large On-Chip Split 64-Kbyte Level-One (LI) Cache 

♦ 32-Kbyte instruction cache with additional 20-Kbytes of predecode cache 

♦ 32-Kbyte writeback dual-ported data cache 

♦ Two-way set associative 

♦ MESI protocol support 

■ 3D Technology 

♦ Additional instructions to improve 3D graphics and multimedia performance 

♦ Separate multiplier and ALU for superscalar instruction execution 

■ Compatible with both Super?^ and Socket 7 

♦ Compatible with both the 66-MHz processor bus and 100-MHz processor bus 

♦ Accelerated Graphic Port (AGP) support 

■ High-Performance IEEE 754-Compatible and 854-Compatible Floating-Point Unit 

■ High-Performance Industry- Standard MMX7" Instructions 

♦ Dual integer ALU for superscalar execution 

■ 321-Pin Ceramic Pin Grid Array (CPGA) Package 

■ Industry-Standard System Management Mode (SMM) 

■ IEEE 1149.1 Boundary Scan 

■ Full x86 Binary Software Compatibility 

/ 
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J AMD-K6 5D Processor 
mmmmmmmmmiwmmmmmmmmim 

As the newest member of the AMD K86™ family of x86 processors, the innovative 
AMD-K6 3D processor brings industry-leading performance to PC systems running 
the extensive installed base of x86 software. Its Super7-compatible, 321-pin ceramic 
pin grid array (CPGA) package enables the processor to reduce time-to-market by 
leveraging today's cost-effective infrastructure to deliver a superior 
price/performance PC solution. 

The AMD-K6 3D processor is the first to incorporate 3D technology, a significant 
innovation to the x86 processor architecture that drives today's personal computers. 
With 3D technology, new, more powerful hardware and software applications enable a 
more entertaining and productive PC platform. Improvements include faster frame 
rates on high-resolution scenes, superior modeling of real world environments and 
physics, sharper and more detailed 3D imaging, smoother video playback, and near 
theater-quality audio. 

AMD has taken a leadership role in developing new instructions that enable exciting 
new levels of performance and realism. 3D technology was defined and implemented 
in collaboration with Microsoft®, application developers, and graphics vendors, and 
has received an enthusiastic reception. It is compatible with today's existing x86 
software and requires no operating system support, thereby enabling 3D applications 
to work with all existing operating systems. 

To provide state-of-the-art performance, the processor incorporates the innovative 
and efficient RISC86 microarchitecture, a large 64-Kbyte level-one cache (32-Kbyte 
dual-ported data cache, 32-Kbyte instruction cache with an additional 20-Kbytes of 
predecode cache), a powerful IEEE 754-compatible and 854-compatible floating-point 
execution unit, and a high-performance industry-standard multimedia execution unit 
for executing MMX instructions. The processor includes additional high-performance 
Single Instruction Multiple Data (SIMD) execution resources to support the 3D 
technology. These techniques have ^een combined to deliver industry leadership in 
16-bit and 32-bit performance, providing exceptional performance for both 
Windows 95 and Windows NT software bases. 

The AMD-K6 3D processor's 6-issue RISC86 microarchitecture is a decoupled 
decode/execution superscalar design that implements state-of-the-art design 
techniques to achieve leading-edge performance. Advanced design techniques 
implemented in the AMD-K6 3D processor include multiple x86 instruction decode, 
single-clock internal RISC operations, ten execution units that support superscalar 
operation, out-of-order execution, data forwarding, speculative execution, and 
register renaming. In addition, the processor supports the industry's most advanced 
branch prediction logic by implementing an 8192-entry branch history table, the 
industry's only branch target cache, and a return address stack, which combine to 
deliver better than a 95% prediction rate. These design techniques enable the 
AMD-K6 3D processor to issue, execute, and retire multiple x86 instructions per 
clock, resulting in excellent scaleable performance. 
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The AMD-K6 3D processor is fully x86 binary code compatible. AMD's extensive 
experience through six generations of x86 processors has been carefully integrated 
into the processor to ensure complete compatibility with Windows 9x, Windows 3.x, 
Windows NT, DOS, OS/2, Unix, Solaris, NetWare®, Vines, and other leading x86 
operating systems and applications. The AMD-K6 3D processor is Super? and Socket 
7-compatible. The Super? initiative is an extension to today's popular and robust 
Socket 7 platform. See "Super7 Platform Initiative" for more information. 

AMD has designed, manufactured, and delivered over 50 million Microsoft 
Windows-compatible processors in the last five years alone. The AMD-K6 3D 
processor is the latest member in this long line of processors. With its combination of 
state-of-the-art features, industry-leading performance, high-performance 3D and 
multimedia engines, full x86 compatibility, and low-cost infrastructure, the AMD-K6 
3D is the superior choice for mainstream personal computers. 

Super? Platform Initiative 



AMD and its industry partners are investing in the future of Socket 7 with the new 
Super7 platform initiative. The goal of the initiative is to maintain the competitive 
vitality of the Socket 7 infrastructure through a series of planned enhancements, 
including the development of an industry-standard lOO-MHz processor bus protocol. 

In addition to the lOO-MHz processor bus protocol, the Super7 initiative includes the 
introduction of chipsets that support the AGP specification, and support for a 
backside L2 cache and frontside L3 cache. 

Super7 Enhancements 

The Super7 platform has the following enhancements: 

■ lOO'MHz processor bus— The AMD-K6 3D processor supports a lOO-MHz, 800 
Mbyte/second frontside bus to provide a high-speed interface to Super7 
platform-based chipsets. The 1 00-MHz interface to the frontside Level 2 (L2) 
cache and main system memory speeds up access to the frontside cache and main 
memory by 50 percent over the 66-MHz Socket 7 interface — a significant increase 
in system performance equivalent to a jump of up to two processor speed grades. 

■ Accelerated graphics port support — AGP improves the performance of mid-range 
PCs that have small amounts of video memory on the graphics card. The 
industry-standard AGP specification enables a 133-MHz graphics interface and 
will scale to even higher levels of performance. 

■ Support for backside L2 and frontside L3 cache — The Super7 platform has the 
'headroom' to support higher-performance AMD-K6 processors, with clock speeds 
scaling to 400 MHz and beyond. Future versions of the AMD-K6 processor will 
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AMD-K6 5D Processor 



feature a full-speed, on-chip backside 256-Kbyte L2 cache designed to deliver new 
levels of system performance to mainstream desktop systems. These versions of 
the processor will also support an optional 100-MHz frontside L3 cache for even 
higher-performance system configurations. 

Super? Advantages 

The Super? platform has the following advantages: 

■ Delivers performance and features competitive with alternate platforms at the 
same clock speed, and at a significantly lower cost 

■ Takes advantage of existing system designs for superior value 

■ Enables OEMs and resellers to take advantage of mature, high-volume 
infrastructure supported by multiple BIOS, chipset, graphics, and motherboard 
suppliers 

■ Reduces inventory and design costs with one motherboard for a wide range of 
products 

■ Builds on a huge installed base of more than 100 million motherboards 

■ Provides an easy upgrade path for future PC users, as well as a bridge to legacy 



By taking advantage of the low-cost, mature Socket 7 infrastructure, the Super? 
platform will continue to provide superior value and leading-edge performance for 
mainstream desktop systems. 



users 
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2 

Internal Architecture 



Introduction 



The AMD-K6 3D processor implements advanced design 
techniques known as the RISC86 microarchitecture. The RISC86 
microarchitecture is a decoupled decode/execution design 
approach that yields superior sixth-generation performance for 
x86-based software. This chapter describes the techniques used 
and the functional elements of the RISC86 microarchitecture. 

AIVID-K6 3D Processor Microarchitecture Overview 



When discussing processor design, it is important to understand 
the terms architecture^ microarchitecture^ and design 
implementation. The term architecture refers to the instruction 
set and features that are visible to software programs running on 
the processor. The architecture determines what software the 
processor can run. The architecture of the AMD-K6 3D processor 
is the industry-standard x86 instruction set. 

The term microarchitecture refers to the design techniques used 
in the processor to reach the target cost, performance, and 
functionality goals. The AMD-K6 family of processors are based 
on a sophisticated RISC core known as the enhanced RISC86 
microarchitecture. The enhanced R[SC86 microarchitecture is 
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an advanced, second-order decoupled decode/execution design 
approach that enables industry-leading performance for 
x86-based software. 

The term design implementation refers to the actual logic and 
circuit designs from which the processor is created according to 
the microarchitecture specifications. 

Enhanced RISC86 Microarchitecture 

The enhanced RISC86 microarchitecture defines the 
characteristics of the AMD-K6 family. The innovative RISC86 
microarchitecture approach implements the x86 instruction set 
by internally translating x86 instructions into RISC86 
operations. These RISC86 operations were specially designed to 
include direct support for the x86 instruction set while observing 
the RISC performance principles of fixed-length encoding, 
regularized instruction fields, and a large register set. The 
enhanced RISC86 microarchitecture used in the AMD-K6 3D 
processor enables higher processor core performance and 
promotes straightforward extensions, such as those added in the 
current AMD-K6 3D processor and those planned for the future. 
Instead of directly executing complex x86 instructions, which 
have lengths of 1 to 15 bytes, the AMD-K6 3D processor executes 
the simpler and easier fixed-length RISC86 opcodes, while 
maintaining the instruction coding efficiencies found in x86 
programs. 

The processor contains parallel decoders, a centralized RISC86 
operation scheduler, and ten execution units that support 
superscalar operation — multiple decode, execution, and 
retirement — of x86 instructions. These elements are packed into 
an aggressive and highly efficient six-stage pipeline. 

As shown in Figure 1 on page 7, the high-performance, 
out-of-order execution engine of the AMD-K6 3D processor is 
mated to a split level-one 64-Kbyte writeback cache with 32 
Kbytes of instruction cache and 32 Kbytes of data cache. The 
instruction cache feeds the decoders and, in turn, the decoders 
feed the scheduler. The ICU issues and retires RISC86 
operations contained in the scheduler The system bus interface 
is an industry-standard 64-bit Super? and Socket 7 
demultiplexed bus. 



Processor Block 
Diagram 
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Internal Architecture 





Figure 1. AMD-K6 3D Processor Block Diagram 

The AMD-K6 3D processor combines the latest in processor 
microarchitecture to provide the highest x86 performance for 
today's personal computers. The processor offers true 
sixth-generation performance and full x86 binary software 
compatibility. 

Decoders Decoding of the x86 instructions begins when the on-chip 

instruction cache is filled. Predecode logic determines the 
length of an x86 instruction on a byte-by-byte basis. This 
predecode information is stored, along with the x86 instructions, 
in the instruction cache, to be used later by the decoders. The 
decoders translate on-the-fly, with no additional latency, up to 
two x86 instructions per clock into RISC86 operations. 

Note: In this chapter, "clock" refers to a processor clock. 

The AMD-K6 3D processor categorizes x86 instructions into 
three types of decodes — short, long, and vector. The decoders 
process either two short, one long, or one vector decode at a time. 
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The three types of decodes have the following characteristics: 

■ Short decodes — x86 instructions less than or equal to seven 
bytes in length 

■ Long decodes — x86 instructions less than or equal to 11 bytes 
in length 

■ Vector decodes — complex x86 instructions 

Short and long decodes are processed completely within the 
decoders. Vector decodes are started by the decoders and then 
completed by fetched sequences from an on-chip ROM. After 
decoding, the RISC86 operations are delivered to the schedider 
for dispatching to the executions units. 

Scheduler/Instruction The centralized scheduler or buffer is managed by the 
Control Unit Instruction Control Unit (ICU). The ICU buffers and manages up 

to 24 RISC86 operations at a time. This equals from 6 to 12 x86 
instructions. This buffer size (24) is perfectly matched to the 
processor's six-stage RISC86 pipeline and 4 RISC86-operations 
decode rate. The scheduler accepts as many as four RISC86 
operations at a time from the decoders and retires up to 4 RISC86 
operations per clock cycle. The ICU is capable of simultaneously 
issuing up to six RISC86 operations at a time to the execution 
units. This consists of the following types of operations: 

■ Memory load operation 

■ Memory store operation 

■ Complex integer, MMX or 3D register operation 

■ Simple integer, MMX or 3D register operation 

■ Floating-point register operation 

■ Branch condition evaluation 

Registers The scheduler uses 48 physical registers that are contained 

within the RISC86 microarchitecture when managing the 24 
RISC86 operations. The 48 physical registers are located in a 
general register file and are grouped as 24 general registers, plus 
24 renaming registers. The 24 general registers consist of 16 
scratch registers and eight registers that correspond to the x86 
general purpose registers— EAX, EBX, ECX, EDX, EBP, ESP, 
ESI and EDI. 
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Internal Architecture 



Branch Logic The AMD-K6 3D processor is designed with highly sophisticated 

dynamic branch logic consisting of the following: 

■ Branch history/Prediction table 

■ Branch target cache 

■ Return address stack 

The processor implements a two-level branch prediction scheme 
based on an 8192-entry branch history table. The branch history 
table stores prediction information that is used for predicting 
conditional branches. Because the branch history table does not 
store predicted target addresses, special address ALUs calculate 
target addresses on-the-fly during instruction decode. The 
branch target cache augments predicted branch performance by 
avoiding a one clock cache-fetch penalty. This specialized target 
cache does this by supplying the first 16 bytes of target 
instructions to the decoders when branches are predicted. The 
return address stack is a unique device specifically designed for 
optimizing CALL and RETURN pairs. In summary, the AMD-K6 
3D processor uses dynamic branch logic to minimize delays due 
to the branch instructions that are common in x86 software. 

5D Technology AMD has taken a lead role in improving the multimedia and 3D 

capabilities of the x86 processor family with the introduction of 
3D technology, which uses a packed, single-precision, 
floating-point data format and Single Instruction Multiple Data 
(SIMD) operations based on the MMX model. For more 
information, see Chapter 4» "3D Technology" on page 81. 
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Cache, Instruction Prefetch, and Predecode Bits 

The writeback level-one cache on the AMD-K6 3D processor is 
organized as a separate 32-Kbyte instruction cache and a 
32-Kbyte data cache with two-way set associativity. The cache 
line size is 32 bytes and lines are prefetched from main memory 
using an efficient pipelined burst transaction. As the instruction 
cache is filled, each instruction byte is analyzed for instruction 
boundaries using predecoding logic. Predecoding annotates 
information (5 bits per byte) to each instruction byte that later 
enables the decoders to efficiently decode multiple instructions 
simultaneously. 

Cadie The processor cache design takes advantage of a sectored 

organization (see Figure 2 on page 11). Each sector consists of 64 
bytes configured as two 32-byte cache lines. The two cache lines 
of a sector share a common tag but have separate pairs of MESI 
(Modified, Exclusive, Shared, Invalid) bits that track the state of 
each cache line. 

Two forms of cache misses and associated cache fills can take 
place — a sector replacement and a cache line replacement. In 
the case of a sector replacement, the miss is due to a tag 
mismatch, in which case the required cache line is filled from 
external memory, and the cache line within the sector that was 
not required is marked as invalid. In the case of a cache line 
replacement, the address matches the tag, but the requested 
cache line is marked as invalid. The required cache line is filled 
from external memory, and the cache line within the sector that 
is not required remains in the same cache state. 

Prefetching The processor performs cache prefetching for sector 

replacements only — as opposed to cache line replacements. This 
cache prefetching results in the filling of the required cache line 
first, and a prefetch of the second cache line. Furthermore, the 
prefetch of the cache Une that is not required is initiated only in 
the forward direction — that is, only if the requested cache line is 
the first cache line within the sector. From the perspective of the 
external bus, the two cache-line fills typically appear as two 
32-byte burst read cycles occurring back-to-back or, if allowed, as 
pipelined cycles. 
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Internal Architecture 



The 3D technology includes a new instruction called PREFETCH 
that allows a cache line to be prefetched into the data cache. The 
PREFETCH instruction format is defined in Table 15, "3D 
Instructions," on page 79. For more detailed information, see 
Chapter 4, "3D Technology" on page 81. 

Predecodc BHs Decoding x86 instructions is particularly difficult because the 

instructions are variable-length and can be from 1 to 15 bytes 
long. Predecode logic supplies the five predecode bits that are 
associated with each instruction byte. The predecode bits 
indicate the number of bytes to the start of the next x86 
instruction. The predecode bits are stored in an extended 
instruction cache alongside each x86 instruction byte as shown 
in Figure 2. The predecode bits are passed with the instruction 
bytes to the decoders where they assist with parallel x86 
instruction decoding. 
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Internal Architecture 



Instruction Fetch and Decode 



Instruction Fetch The processor can fetch up to 16 bytes per clock out of the 

instruction cache or branch target cache. The fetched 

information is placed into a 16-byte instruction buffer that feeds 
directly into the decoders (see Figure 3). Fetching can occur 
along a single execution stream with up to seven outstanding 
branches taken. 

The instruction fetch logic is capable of retrieving any 16 
contiguous bytes of information within a 32-byte boundary. 
There is no additional penalty when the 16 bytes of instructions 
cross a cache-line boundary. The instruction bytes are loaded 
into the instruction buffer as they are consumed by the decoders. 
Although instructions can be consumed with byte granularity, 
the instruction buffer is managed on a memory-aligned word 
(two bytes) organization. Therefore, instructions are loaded and 
replaced with word granularity. When a control transfer 
occurs — such as a JMP instruction — the entire instruction 
buffer is flushed and reloaded with a new set of 16 instruction 
bytes. 
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Instruction Cache 
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Figures. The Instruction Buffer 
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Internal Architecture 



InstracUon Decode 



The AMD-K6 3D processor decode logic is designed to decode 
multiple x86 instructions per clock (see Figure 4). The decode 
logic accepts x86 instruction bytes and their predecode bits from 
the instruction buffer, locates the actual instruction boundaries, 
and generates RISC86 operations from these x86 instructions. 

RISC86 operations are fixed-length internal instructions. Most 
RISC86 operations execute in a single clock. RISC86 operations 
are combined to perform every function of the x86 instruction 
set. Some x86 instructions are decoded into as few as zero 
RISC86 opcodes— for instance a NOP — or one RISC86 
operation — a register-to-register add. More complex x86 
instructions are decoded into several RISC86 operations. 



































Instruction Buffer 




Figure 4. Processor Decode Logic 
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Internal Architecture 



The AMD-K6 3D processor uses a combination of decoders to 
convert x86 instructions into RISC86 operations. The hardware 
consists of three sets of decoders — two parallel short decoders, 
one long decoder, and one vector decoder. The two parallel short 
decoders translate the most commonly-used x86 instructions 
(moves, shifts, branches, ALU, MMX, 3D, FPU) into zero, one, or 
two RISC86 operations each. The short decoders only operate on 
x86 instructions that are up to seven bytes long. In addition, they 
are designed to decode up to two x86 instructions per clock. The 
commonly-used x86 instructions that are greater than seven 
bytes but not more than 11 bytes long, and semi-commonly-used 
x86 instructions that are up to seven bytes long are handled by 
the long decoder. 

The long decoder only performs one decode per clock and 
generates up to four RISC86 operations. All other translations 
(complex instructions, serializing conditions, interrupts and 
exceptions, etc.) are handled by a combination of the vector 
decoder and RISC86 operation sequences fetched from an 
on-chip ROM. For complex operations, the vector decoder logic 
provides the first set of RISC86 operations and a vector (initial 
ROM address) to a sequence of further RISC86 operations. The 
same types of RISC86 operations are fetched from the ROM as 
those that are generated by the hardware decoders. 

Note: Although all three sets of decoders are simultaneously fed a 
copy of the instruction buffer contents, only one of the three 
types of decoders is used during any one decode clock. 

The decoders or the on-chip RISC86 ROM always generate a 
group of four RISC86 operations. For decodes that cannot fill the 
entire group with four RISC86 operations, RISC86 NOP 
operations are placed in the empty locations of the grouping. For 
example, a long-decoded x86 instruction that converts to only 
three RISC86 operations is padded with a single RISC86 NOP 
operation and then passed to the scheduler. Up to six groups or 
24 RISC86 operations can be placed in the scheduler at a time. 
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Internal Architecture 




All of the common, and a few of the uncommon, floating-point 
instructions (also known as ESC instructions) are hardware 
decoded as short decodes. This decode generates a RISC86 
floating-point operation and, optionally, an associated 
floating-point load or store operation. Floating-point or ESC 
instruction decode is only allowed in the first short decoder, but 
non-£SC instructions can be decoded simultaneously by the 
second short decoder along with an ESC instruction decode in 
the first short decoder. 

All of the MMX and 3D instructions, with the exception of the 
EMMS, FEMMS, and PREFETCH instructions, are hardware 
decoded as short decodes. The MMX instruction decode 
generates a RISC86 MMX operation and, optionally, an 
associated MMX load or store operation. An 3D instruction 
decode generates a RISC86 3D operation and, optionally, an 
associated load or store operation. MMX and 3D instructions can 
be decoded in either or both of the short decoders. 
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* Internal Architecture 



Centralized Scheduler 



The scheduler is the heart of the AMD-K6 3D processor (see 
Figure 5 on page 17). It contains the logic necessary to manage 
out-of-order execution, data forwarding, register renaming, 
simultaneous issue and retirement of multiple RISC86 
operations, and speculative execution. The scheduler's buffer 
can hold up to 24 RISC86 operations. This equates to a maximum 
of 12 x86 instructions. The scheduler can issue RISC86 
operations from any of the 24 locations in the buffer. When 
possible, the scheduler can simultaneously issue a RISC86 
operation to any available execution unit (store, load, branch, 
register X integer/MMX, register y integer/MMX, or 
floating-point). In total, the scheduler can issue up to six and 
retire up to four RISC86 operations per clock. 

The main advantage of the scheduler and its operation buffer is 
the ability to examine an x86 instruction window equal to 12 x86 
instructions at one time. This advantage is due to the fact that 
the scheduler operates on the RISC86 operations in parallel and 
allows the processor to perform dynamic on-the-fly instruction 
code scheduling for optimized execution. Although the 
scheduler can issue RISC86 operations for out-of-order 
execution, it always retires x86 instructions in order. 
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Figures. Processor Scheduler 
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Internal Architecture 

Execution Units 

The AMD-K6 3D processor contains ten parallel execution 
units— store, load, integer X ALU, integer Y ALU, MMX ALU 
(X), MMX ALU (Y), MMX/3D multiplier, 3D ALU, floating-point, 
and branch condition. Each unit is independent and capable of 
handling the RISC86 operations. Table 1 on page 19 details the 
execution units, functions performed within these units, 
operation latency, and operation throughput. For more 
information, see "Execution Units and Dependency Latencies" 
on page 458 in Appendix B. 

The store and load execution units are two-stage pipelined 
designs. The store unit performs data writes and register 
calculation for LEA/PUSH. Data memory and register writes 
from stores are available after one clock. Store operations are 
held in a store queue prior to execution. From there, they 
execute in order. The load unit performs data memory reads. 
Data is available from the load unit after two clocks. For more 
information, see "Load Unit" on page 462 and Store Unit" on 
page 463, both in Appendix B. 

The Integer X execution unit can operate on all ALU operations, 
multiplies, divides (signed and unsigned), shifts, and rotates. 

The Integer Y execution unit can operate on the basic word and 
doubleword ALU operations— ADD, AND, CMP, OR, SUB, XOR, 
zero-extend and sign-extend operands. 
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Table 1. Execution Latency and Throughput of Execution Units 



Functional Unit 


Functkui 


Latency 


Throughput 


Store 


LEyyPUSH, Address (Pipelined) 


1 




Memory Store (Pipelined) 


1 




Load 


Memory Loads (Pipelined) 


2 






Integer ALU 


1 




Integer X 


Integer Multiply 


2-3 


2-3 




Integer Shift 


1 




Integer 
MMX 


MMX ALU 


1 




MMX Shifts, Packs, Unpacl( 


1 




MMX Multiply 


2 




Integer Y 


Basic ALU (16-blt and 32-bit operands) 


1 




Branch 


ResoWes Branch Conditions 


1 




FPU 


FADD, FSUB, FMUL 


2 






3D ALU 


2 




3D 


3D Multiply 


2 






3D Convert 


2 
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Register X and Y The MMX units and the 3D units share pipeline control with the 

Pipelines Integer X and Integer Y units. 

The register X and Y functional units are attached to the issue 
bus for the register X execution pipeline or the issue bus for the 
register Y execution pipeline or both. Figure 6 shows the details 
of the X and Y register pipelines. Each register pipeline has 
dedicated resources that consist of an integer execution unit and 
an MMX ALU execution unit, therefore allowing superscalar 
operation on integer and MMX instructions. In addition, both 
the X and Y issue buses are connected to the 3D ALU, the 
3D/MMX multiplier and MMX shifter, which allows the 
appropriate RISC86 operation to be issued through either bus. 
For more information, see Figure 50 on page 91 in Chapter 4 and 
"Register Execution Units" on page 460 in Appendix B. 

The branch condition unit is separate from the branch prediction 
logic in that it resolves conditional branches such as JCC and 
LOOP after the branch condition has been evaluated. For more 
information, see "Branch Condition Unit" on page 464 in 
Appendix B. 



Scheduler 
Buffer 

(24 RISC86) 





i Issue Bus 
f for the 
J Register 
; Execution 
I Pipeline 



MMX 

ALU 



^^^^ 



Integer Y 

ALU 



Figures. Register X and Y Functional Units 

20 



177AMD0060056 



Internal Architecture 




Branch-Prediction Logic 

Sophisticated branch logic that can minimize or hide the impact 
of changes in program flow is designed into the AMD-K6 3D 
processor. Branches in x86 code fit into two categories — 
unconditional branches, which always change program flow (that 
is, the branches are always taken) and conditional branches, 
which may or may not divert program flow (that is, the branches 
are taken or not-taken). When a conditional branch is not taken, 
the processor simply continues decoding and executing the next 
instructions in memory. 

Typical applications have up to 10% of unconditional branches 
and another 10% to 20% conditional branches. The processor 
branch logic has been designed to handle this type of program 
behavior and its negative effects on instruction execution, such 
as stalls due to delayed instruction fetching and the draining of 
the processor pipeline. The branch logic contains an 8192-entry 
branch history table, a 16-entry by 16-byte branch target cache, a 
16-entry return address stack, and a branch execution unit. 

Branch History Table The AMD-K6 3D processor handles unconditional branches 
without any penalty by redirecting instruction fetching to the 
target address of the unconditional branch. However, 
conditional branches require the use of the dynamic 
branch-prediction mechanism built into the processor. A 
two-level adaptive history algorithm is implemented in an 
8192-entry branch history table. This table stores executed 
branch information, predicts individual branches, and predicts 
the behavior of groups of branches. To accommodate the large 
branch history table, the processor does not store predicted 
target addresses. Instead, the branch target addresses are 
calculated on-the-fly using ALUs during the decode stage. The 
adders calculate all possible target addresses before the 
instructions are fully decoded and the processor chooses which 
addresses are valid. 

Branch Target Cache To avoid a one clock cache-fetch penalty when a branch is 
predicted taken, a built-in branch target cache supplies the first 
16 bytes of instructions directly to the instruction buffer 
(assuming the target address hits this cache). (See Figure 3 on 
page 12.) The branch target cache is organized as 16 entries of 16 
bytes. In total, the branch prediction logic achieves branch 
prediction rates greater than 95%. 
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Internal Architecture 

m 



Returtt Address Stack 



Branch Execution 
UnH 



The return address stack is a special device designed to optimize 
CALL and RET pairs. Software is typically compiled with 
subroutines that are frequently called from various places in a 
program. This is usually done to save space. Entry into the 
subroutine occurs with the execution of a CALL instruction. At 
that time, the processor pushes the address of the next 
instruction in memory following the CALL instruction onto the 
stack (allocated space in memory). When the processor 
encounters a RET instruction (within or at the end of the 
subroutine), the branch logic pops the address from the stack 
and begins fetching from that location. To avoid the latency of 
main memory accesses during CALL and RET operations, the 
return address stack caches the pushed addresses. 

The branch execution unit enables efficient speculative 
execution. This unit gives the processor the ability to execute 
instructions beyond conditional branches before knowing 
whether the branch prediction was correct. The processor does 
not permanently update the x86 registers or memory locations 
until all speculatively executed conditional branch instructions 
are resolved. When a prediction is incorrect, the processor backs 
out to the point of the mispredicted branch instruction and 
restores all registers. The AMD-K6 3D processor can support up 
to seven outstanding branches. 
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Software Environmeiit 



This chapter provides a general overview of the AMD-K6 3D 
processor's x86 software environment and briefly describes the 
data types, registers, operating modes, interrupts, and 
instructions supported by the AMD-K6 3D architecture and 
design implementation. 

Registers 



The AMD-K6 3D processor contains all the registers defined by 
the x86 architecture, including general-purpose, segment, 
floating-point, MMX (3D), EFLAGS, control, task, debug, test, 
and descriptor/memory-management registers. In addition, this 
chapter provides information on the processor model-specific 
registers (MSRs), 

Note: Areas of the register designated as Reserved should not be 
modified by software. 

General-Purpose Registers 

The eight 32-bit x86 general-purpose registers are used to hold 
integer data or memory pointers used by instructions. Table 2 
on page 24 contains a list of the general-purpose registers and 
the fimctions for which they are used. 
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Table 2. General-Purpose Registers 



Register 


Function 


EAX 


CommonlY used as an accumulator 


EBX 


Commonly used as a pointer 


Ea 


Commonty used for counting in loop operations 


EDX 


Commonly used to hold f/0 information and to pass parameters 


EDI 


Commonly used as a destination pointer by the ES segment 


ESI 


Commonly used as a source pointer by the DS segment 


ESP 


Used to point to the stack segment 


EBP 


Used to point to data within the stack segment 



In order to support byte and word operations, EAX, EBX, ECX, 
and EDX can also be used as 8-bit and 16-bit registers. The 
shorter registers are overlaid on the longer ones. For example, 
the name of the 16-bit version of EAX is AX (low 16 bits of 
EAX) and the 8-bit names for AX are AH (high order bits) and 
AL (low order bits). The same naming convention applies to 
EBX, ECX, and EDX. EDI, ESI, ESP, and EBP can be used as 
smaller 16-bit registers called DI, SI, SP, and BP respectively, 
but these registers do not have 8-bit versions. Figure 7 shows the 
EAX register with its name components, and Table 3 on page 25 
lists the doubleword (32-bit) general-purpose registers and 
their corresponding word (16-bit) and byte (8-bit) versions. 



16 15 



8 7 



EAX. 



-AX- 



-AH- 



-Al- 



Figure 7. EAX Register with 16-Bit and d-Bit Name Components 
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Table 3. General-Purpose Register Dword, Word, and Byte Names 



J^A'On ndlllc 

(Dword) 


lO'DIl nalllc 

(Word) 


o oil naiiiir 

(High-order Bits) 


fURit N;im0 
o-Dll riaiiic 

(Low-order Bits) 


EAX 


AX 


AH 


AL 


EBX 


BX 


BH 


BL 


EQ 


a 


CH 


CL 


EDX 


DX 


DH 


DL 


EDI 


Dl 






ESI 


SI 






ESP 


SP 






EBP 


BP 







Integer Data Types 



Byte Integer 



Four types of data are used in general-piupose registers — byte, 
word, doubleword, and quadword integers. Figure 8 shows the 
format of the integer data registers. 



Precision - 
8Bits 



Word Integer 



15 



Rredston- 16 Bib 



Doubleword Integer 



31 



Precision - 32 Bits 



Quadword Integer 

63 



Precision - 64 Bits 



Figures. Integer Data Registers 
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Segment Registers 

The six 16-bit segment registers are used as pointers to areas 
(segments) of memory. Table 4 lists the segment registers and 
their functions. Figure 9 shows the format for all six segment 
registers. 



Table 4. Segment Registers 



Segment 
Register 


Segment Re^er Function 


CS 


Code segment where instructions are located 


DS 


Data segment, where data is located 


ES 


Data segment, where data is located 


FS 


Data segment, where data is located 


GS 


Data segment, where data is located 


SS 


Stack segment 


15 


0 



Figures. Segment Register 

Segment Usage 

The operating system determines the type of memory model 
that is implemented. The segment register usage is determined 
by the operating system's memory model. In a Real mode 
memory model the segment register points to the base address 
in memory. In a Protected mode memory model the segment 
register is called a selector and it selects a segment descriptor 
in a descriptor table. This descriptor contains a pointer to the 
base of the segment, the limit of the segment, and various 
protection attributes. For more information on descriptor 
formats, see ^'Descriptors and Gates" on page 48. Figure 10 on 
page 27 shows segment usage for Real mode and Protected 
mode memory models. 
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Physical Memory 





Segment Base 


Segment Register 









Real Mode Memory Model 



Oescriplor Table 



Segment Selector 











Base III jUmit 


1 II 1 


Base 


Umit 











Physical Memoiy 




Protected IHode Memory Model 



RgurelO. Segment Usage 



Instruction Pointer 

The instruction pointer (EIP or IP) is used in conjunction with 
the code segment register (CS). The instruction pointer is 
either a 32-bit register (EIP) or a 16-bit register (IP) that keeps 
track of where the next instruction resides within memory. This 
register cannot be directly manipulated, but can be altered by 
modifying return pointers when a JMP or CALL instruction is 
used. 



Floating-Point Registers 

The floating-point execution unit in the AMD-K6 3D processor 
is designed to perform mathematical operations on non-integer 
numbers. This floating-point unit conforms to the IEEE 754 and 
854 standards and uses several registers to meet these 
standards — eight numeric floating-point registers, a status 
word register, a control word register, and a tag word register. 
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The eight floating-point registers are physically 80 bits wide 
and labeled FPR0-FPR7. Figure 11 shows the format of these 
floating-point registers. For more information, see Chapter 9, 
"Floating-Point and Multimedia Execution Units" on page 253. 
See "Floating-Point Register Data Types'* on page 30 for 
information on allowable floating-point data types. 



79 78 



64 63 



Sign 



Exponent 



Significand 



figure 11. Floating-Point Register 



The 16-bit FPU status word register contains information about 
the state of the floating-point unit. Figure 12 shows the format 
of this register. 



15 14 13 12 11 10 9 8 7 6 5 4 3 2 I 0 



B 


C 
5 


TOSP 


C 
2 




c 

0 


E 

s 


s 

F 


P 

E 


U 

E 


0 
E 


Z 
E 


D 

E 


1 

E 































Symbol Pwcription Bits 

B FPU Busy 15 -J 
a Condition Code 14 



TOSP Top of Stack Pointer 13-11 

Q Condition Code 10 

CI Condition Code 9 

CO Condition Code 8 

ES Error Summary Status 7 

SF Stad Fault 6 

Exception Flag s 

PE Precision Error 5 

Ub Underflow Error 4 

OE Overflow Error 3 

Zt Zero Divide Eror 2 

DE Denormabed Operation Error l 

IE Invdid Operation Error 0 

TOSP Infqrnwtign 

000=FPRO 

11I = FPR7 



Hgurel2. FPU Status Word Register 
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The FPU control word register allows a programmer to manage 
the FPU processing options. Figure 13 shows the format of this 
register. 



15 14 13 12 11 10 



Symbol 


Description 




Y 


Infinity Bit (80287 compatibility) 1 2 - 


RC 


Rounding Control 


11-10- 


PC 


Precision Control 


9-8 - 




ExceDtion Masks 




PM 


Precision 


5 - 


UM 


Underflow 


4 - 


CM 


Overflow 


3 - 


ZM 


Zero Divide 


2 - 


DM 


Denormalized Operation 


1 - 


IM 


Invalid Operation 


0 - 




Roundins Control Information 




00b 


= Round to the nearest or even number 




Olb 


*■ Round down toward negative infinity 




10b 


= Round up toward positive infinity 




nb 


=Tmncate toward zero 






5 4 3 2 1 0 



p 


U 


0 


Z 


D 


1 


M 


M 


M 


M 


M 


M 



Precision Control Information 

OOb = 24 bits Single Precision Real 
01 b= Reserved 

lOb = 53 bits Double Precision Real 
1 1 b ^ 64 bits Extended Precision Real 



Figure 13. FPU Control Word Register 

The FPU tag word register contains information about the 
registers in the register stack. Figure 14 shows the format of this 
register. 



15 14 15 12 n 10 9 8 7 6 5 4 3 2 1 0 



TAG 


TAG 


TAG 


TAG 


TAG 


TAG 


TAG 


TAG 


(FPR7 


(FPR6 


(FPR5 


(FPR4 


(FPR3 


(FPR2 


(FPRl 


(FPRO 



MMis 
00= Valid 
01=Zero 
10= Special 
11 "Empty 



Figure 14. FPU Tag Word Register 
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Floating-Point Register Data Types 



Floating-point registers use four different types of data — 
packed decimal, single-precision real, double-precision real, 
and extended-precision real. Figures 15 and 16 show the 
formats for these registers. For more information, see Chapter 
9, "Floating-Point and Multimedia Execution Units" on page 
253 



79 78 72 71 



s 


Ignore 


or 




Zero 



Precision - 18 Digits, 73 Bits Used, 4-Bit5/Digil 



Ignored on Load. Zeros on Store 78-72 
Sign Bit 79 



Fifflire 15. Packed Decimal Data Register 



Single-Predsion Real 



31 30 25 22 



Biased 
Exponent 



Significand 



S»SignBit 



DouUe-Precision Real 


65 62 52 51 




0 




S 


Biased 
Exponent 


Significand 



S= Sign Bit 



Extended-PraclskHi Real 

79 78 64 63 62 



s 


Biased 


1 






Exponent 




Significand 



S=SignBit 



NInlegerBit 



Figure 16. Precision Real Data Registers 
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MIHX/lDRegbters 

The AMD-K6 3D processor implements eight 64-bit MMX/3D 
registers for use by multimedia software. These registers are 
mapped on the floating-point register stack. The 3D and MMX 
instructions refer to these registers as mmregO to mmreg7. 
Figure 46 on page 84 shows the format of these registers. For 
more information, see "3D Register Set'* on page 84 and "MMX 
Register Set" on page 3S0 in Appendix A. 

MMX Data Types 

For the MMX instructions, the MMX registers use three types of 
data — packed eight-byte integer, packed quadword integer, and 
packed dual doubleword integer. Figure 17 shows the format of 
these data types. For more information, see "MMX Data lype 
Details" on page 352 in Appendix A. 

Packed Bytes Integer 



65 56 


5S 46 


47 


40 


39 32 


31 


24 


23 


16 


15 B 


7 


0 




Byte? 


Byte 6 


Bytes 


Byte4 


Byte3 


Byte 2 


Bytel 


ByteO 


Packed Words Integer 

63 48 


47 




32 


31 




16 


15 




0 




Words 


W6rd2 


Wordi 


WordO 


Packed Doubleword Integer 

63 






32 


31 


0 




Doubleword 1 


Doubleword 0 



figure 17. MMX DaU Types 
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3D Data lypes 

For 3D instructions, the MMX registers use packed 
single-precision real data. Figure 18 shows the format of the 3D 
data type. For more information, see "3D Data Type Details** on 
page 85. 



Padied Single Predsion Floating Point 



6362 55 54 32 51 30 2322 





Biased 






Biased 




s 


Exponent 


Significand 


S 


Exponent 


Significand 



S-SignBrt ^ S=SignBft 



Figure 18. 3D Data Types 
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Software Environment 



The EFLAGS register provides for three different types of 
flags— system, control, and status. The system flags provide 
operating system controls, the control flag provides directional 
information for string operations, and the status flags provide 
information resulting from logiccd and arithmetic operations. 
Figure 19 shows the format of this register. 



Symbol 
ID 
VIP 
VIF 
AC 
VM 
RF 
NT 
lOPL 
OF 
DF 
IF 
TF 
SF 
ZF 
AF 
PF 
CF 



31 30 29 28 27 26 25 24 23 22 21 


20 


19 


18 


17 


16 15 


14 


13 12 


n 


10 


9 


8 


7 


6 5 4 


3 2 


1 


0 




V 


V 






I'll 




1 
















is 






1 


I 


A 


V 






0 


0 


D 


1 


T 


S 




m ^ 




C 




P 


F 


C 


M 


^ m 


T 


P 
L 


F 


F 


F 


F 


F 




m ^ 




F 



DgscriptiQD 

ID Flag 

Virtual Interrupt Pending 
Virtual Interrupt Flag 
Alignment Check 
Virtual-8086 Mode 
Resume Flag 
Nested Task 
I/O Privilege Level 
Ovefflow Flag 
Direction Flag 
IrtemjptFlag 
Trap Flag 
Sign Rag 
Zero Flag 
Auxiliary Flag 
Parity Rag 
Carry Flag 



21 
20 
19 
18 
17 
16 
14 
13-12 
II 
10 
9 
8 
7 
6 
4 
2 
0 



Figure 19. EFLAGS Registers 



33 



177AMD0060069 



Software Environment 



Control Registers 



The five control registers contain system control bits and 
pointers. Figures 20 through 24 show the formats of these 
registers. 



7 6 5 4 3 2 




Symbol 


Desaiption 


Bit 


MCE 


Machine Check Eniible 


6 


PSE 


Page Size Extensions 


A 


DE 


Debugging Extensions 


3 


7SD 


Time Stamp Disable 


2 


m 


Protected Virtual Interrupts 


1 


VME 


Virtual-8086 Mode Extensions 


0 




Figure 20. Control Register 4 (CR4) 



31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 t6 15 14 13 12 II 10 9 8 7 6 5 4 3 2 I 0 



Page Directory Base 




SymtK)! Description Wt 

PCD Page Cache Disable 4 
PWT PageWritethfDUgh 3 



Figure 21. G)ntrol Register 3 (CR3) 



Page Fault Linear Address 



Figure 22. Control Register 2 (CR2) 
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Figure 23. Control Register l (CRI) 



Symbol Desaiption Sli 

- PC Paging 31 

- CD Cache Disabie 30 

- NW NotWritethrough 29 



31 


30 


29 


28 2 7 26 25 24 23 22 2 1 20 1 9 18 1 7 1 5 15 14 13 12 11 10 9 { 


) 7 6 


5 


4 


3 


2 


1 


0 


P 


C 


N 




' ■ • 1 ' i ' 


N 


E 


T 


E 


M 


P 


G 


D 


W 






E 


T 


S 


M 


P 


E 



Symbol 


DescriDtion 


61! 


AM 


Alignment Mask 


18 


WP 


Write Protect 


16 


NE 


Mumeric Error 


5 


ET 


Extension Type 


4 


TS 


Task Switched 


3 


EM 


Emulation 


2 


MP 


Monitor Co-processor 


1 


PE 


Protection Enabled 


0 



Figure 24. Control Register 0 (CRO) 
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Debug Registers 



Figures 25 through 28 show the 32-bit debug registers 
supported by the processor. 



Symbol 


Description 




LEWS 


Length of Breakpoint #5 


31-30 


'R/W3 


Type of Transaction(s) to Trap 


29-28 


•LEN2 


Length of Breakpoint #2 


27-26 


•R/W2 


Type of TrsnsdCtionCs) to Trap 


25-24 


•LENl 


Length of Breakpoint #1 


23-22 


•lywi 


Type of Transaction(s) to Trap 


21-20 


•LENO 


Length of Breakpoint #0 


19-18 


•lywo 


Type of Transartion(s) to Trap 


17-16 



LEN 


IVW 


LEN 




LEN 


R/W 


LEN 




3 


3 


2 


2 


1 


1 


0 


0 



Svmboi 


Description 


£t 


CD 


General Detect Enabled 


13 


GE 


Global bastii Breakpoint Enabled 


9 


LE 


Local bad Breakpoint Enabled 


8 


G3 


Global Exact Breakpoint # 3 Biabled 


7 


L3 


Local baa Breakpoint # 5 Enabled 


6 


a 


Global Exact Breakpoint* 2 Enabled 


5 


L2 


Local bcact Breakpoint # 2 Enabled 


4 


Gl 


Global Exact Breakpoint* l Enabled 


3 


Ll 


Local Exact Breakpoint* 1 Enabled 


2 


CO 


Global Exaa Breakpoint * 0 Enabled 


1 


LO 


Local Bead Breakpoint* 0 Enabled 


0 




9 


8 


7 


6 


5 


4 


3 


2 


1 


0 




L 


G 


L 


L 


L 


G 


L 


G 


L 


E 


E 


3 


3 


2 


2 


1 


1 


0 


0 



Figure 25. Debug Register DR7 
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31 30 39 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 




Symix)! 


Descriotion 


Bit 


BT 


Breakpoint Task Switch 


15 - 


BS 


Breakpoint Single Step 


14 - 


BD 


Breakpoint Debug Access Deteded 


13 - 


B3 


Breakpoint #3 Condition Deteaed 


3 - 


B2 


Breakpoint #2 Condition Detected 


2 - 


81 


Breakpoint #) Condition Detected 


I - 


BO 


Breakpoint #0 Condition Detected 


0 - 



figure 26. Debug Register PR6 




DR5 

31 30 29 2a 27 26 25 24 23 22 21 20 19 18 17 16 15 M 13 12 11 10 9 8 7 6 5 4 3 2 1 0 




Figure 27. Debug Registers DR5 and DR4 
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DR3 

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 H 10 9 B 7 6 5 4 3 2 1 0 



Breakpoint 3 32-bit Linear Address 



DR2 

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 



Breakpoint 2 32-btt Linear Address 



DR1 

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 



Breakpoint 1 32-bit Linear Address 



DRO 

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 



Breakpoint 0 52-bit linear Address 



Hgure 28. Debug Registers DR3, DR2, DR1, and DRO 
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Model-Specific Registers (MSR) 



MCAR and MCTR 



The AMD-K6 3D processor provides seven MSRs. The value in 
the ECX register selects the MSR to be addressed by the 
RDMSR and WRMSR instructions. The values in EAX and EDX 
are used as inputs and outputs by the RDMSR and WRMSR 
instructions. Table 5 lists the MSRs and the corresponding 
value of the ECX register. Figures 29 through 35 show the MSR 
formats. 

Table 5. Model-Specific Registers (MSRs) 



Model-Specific Register 


Value of Ea 


Machine Check Address Register (MCAR) 


OOh 


Machine Check Type Register (MCTR) 


01 h 


Test Register 12 (TR12) 


OEh 


Time Stamp Counter (TSQ 


lOh 


Extended Feature Enable Register (EFER) 


C000_0080h 


SYSCALiySYSRET Target Address Register (STAR) 


C000_0081h 


Write Handling Control Register (WHCR) 


C000_0082h 



For more information about the RDMSR and WRMSR 
instructions, see the AMD K86™ Family BIOS and Software Tools 
Development Guide^ document number 21062. 

The processor does not support the generation of a machine 
check exception. However, the processor does provide a 64-bit 
machine check address register (MCAR), a 64-bit machine 
check type register (MCTR), and a machine check enable 
(MCE) bit in CR4. Because the processor does not support 
machine check exceptions, the contents of the MCAR and 
MCTR are only affected by the WRMSR instruction and by 
RESET being sampled asserted (where all bits in each register 
are reset to 0). 



63 



MCAR 



Figure 29. Machine-Check Address Register (MCAR) 
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63 



5 4 




MCTR 



Rgure 30. Machine-Check Type Register (MCTR) 

Test Register 12 Test register 12 provides a method for disabling the LI caches. 

(TR12) Figure 31 shows the format of TR12. 



53 



4 3 2 10 




Symbol Description Bit 
a Cache Inhibit Bit 3 




Figure 31. Test Register 12 (rR12) 

Time Stamp Counter With each processor clock cycle, the processor increments the 
64-bit time stamp counter (TSC) MSR. Figure 32 shows the 
format of the TSC. 



63 



TSC 



Figure 32. Time Stamp Counter (TSC) 
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Extended Feature 
Enable Register 
(iFER) 



The extended feature enable register (EFER) contains the 
control bits that enable the extended features of the AMD-K6 
3D processor. Figure 33 shows the format of the EFER register, 
and Table 6 defines the function of each bit in the EFER 
register. 
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1 0 




1 



Symbol Description Bit 
SCE S>«em Call/Return Extension 0 ■ 



Figure 33. Extended Feature Enable Register (EFER) 

Table 6. Extended Feature Enable Register (EFER) Definition 



SYSCALL/SYSRET 
Target Address 
Register (STAR) 



Bit 


Description 


R/W 


63-1 


Reserved R 


0 


System Call Extension (SCE) { IVW 



The SYSCALL/SYSRET target address register (STAR) 
contains the target EIP address used by the SYSCALL 
instruction and the 16-bit code and stack segment selector 
bases used by the SYSCALL and SYSRET instructions. Figure 
34 shows the format of the STAR register, and Table 7 on 
page 42 defines the function of each bit of the STAR register. 
For more information, see the SYSCALL and SYSRET Instruction 
Specification Application Note^ dociunent number 21086. 
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48 '17 



32 31 



SrSRFTCSSelertnrfindSS 
SdectaRase 



SYSCALL CS Selector and SS 
Selector Base 



Target EIP Address 



Figure 34. SYSCALiySYSRET Target Address Register (STAR) 
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Table 7. SYSCAIi/SYSRET Target Address Register (STAR) Definition 



Bit 


Description 


R/W 


63-48 


SYSRET CS and SS Selector Base 




47-32 


SYSCALL CS and SS Selector Base 




31-0 


Target EIP Address 





Write HandDng The write handling control register (WHCR) is a MSR that 

Control Register contains three fields—the WCDE bit, write allocate enable 

(WHCR) limit (WAELIM) field, and the write allocate enable 

15-to-16-Mbyte (WAE15M) bit. Figure 35 shows the format of 
WHCR. See "Write Allocate" on page 240 for more information. 
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9 8 7 




WCDE Always program too 8 - 

WAELIM Write Allocate Enable Unit 7-t- 
WAE15M WriteAIIocdtcEnablel5-ta-16-Mbyte 0 - 

Note: Hardware RESET 'mhkSkes this MSR to dl zeros. 



WAELIM 



Figure 35. Write Handling Control Register (WHCR) 

Memory Management Registers 

The AMD-K6 3D processor controls segmented memory 
management with the registers listed in Table 8. Figure 36 on 
page 43 shows the formats of these registers. 



Table 8, Memory Management Registers 



Register Name 


Function 


Global Desaiptor Table Register 


Contains a pointer to the base of the global descriptor table 


Interrupt Descriptor Table Register 


Contains a pointer to the base of the interrupt descriptor table 


Local Descriptor Table Register 


Contains a pointer to the local descriptor table of the current task 


Task Register 


Contains a pointer to the task state segment of the current task 
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Clobai and Interrupt Descriptor TaUe Registers 
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16 15 



32-Bit Linear Base Address 



IS-BIt Limit 



Local Descriptor Table Register and Task Register 



Sdector 
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32 31 



32*Btt Linear Base Address 



32-Bit Limit 



Attributes 



Figure 36. Memory Management Re^sters 
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Task State Segment 

Figure 37 shows the format of the task state segment (TSS). 

31 0 



t/0 Permissbn Bitmap (lOPB) 
(up to 8 Kbytes) 



TSS Limit 
from TR 



Interrupt Redirection Bitnup (IRB) 
(eight 32-bit locations) 



Operating System 
Data Structure 



Base Address of I0P6 


OOOOh |t 


OOOOh 


LOT Selector 


OOOOh 


CS 


OOOOh 


FS 


OOOOh 


DS 


OOOOh 


SS 


OOOOh 


CS 


OOOOh 


ES 



EDI 



ESI 

EBP 

ESP 

m 

EDX 

EO 

EAX 

EFLACS 

EIP 

CR3 

OOOOh I 5S2 

ESP2 

OOOOh I SSI 

BP] 

OOOOh I SSQ 

Esro 

OOOOh I Unk (Prior TSS Selector) o 



figure 37. Task State Segment (TSS) 
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Paging 

The AMD-K6 3D processor can physically address up to four 
Gbytes of memory. This memory can be segmented into pages. 
The size of these pages is determined by the operating system 
design and the values set up in the page directory entries (PDE) 
and page table entries (PTE). The processor can access both 
4-Kbyte pages and 4-Mbyte pages, and the page sizes can be 
intermixed within a page directory. When the page size 
extension (PSE) bit in CR4 is set, the processor translates linear 
addresses using either the 4-Kbyte translation lookaside buffer 
(TLB) or the 4-Mbyte TLB, depending on the state of the page 
size (PS) bit in the page directory entry. Figures 38 and 39 show 
how 4-Kbyte and 4-Mbyte page translations work. 



Page 
Directory 



Table 



4-Kbyte 
Page 
Frame 




Physical 
Address 



PDE 



CR5 



22 21 



12 11 



Page Directoiy 


Page Table 


Page 


Oftsel 


Offset 


Offset 



Linear Address 



Figure 58. 4-Kbyte Paging Medianism 
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Page Directory 


Page 


Offset 


Offset 



linear Address 



Figure 39. 4-Mbyte Paging Mechanism 
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Figures 40 through 42 show the formats of the PDE and PTE. 
These entries contain information regarding the location of 
pages and their status. 



12 11 10 9 



7 6 5 



3 2 10 



I 


i 




P 


P 


U 


w 




H 




A 


C 


W 


/ 


/ 


P 








D 


T 


S 


R 





Page Table Base Address 



Symbol 


Description 


m 


AVL 


Available to Software 


11-9 




Resmed 


8 — 


PS 


Page Size 


7 — 




Reseived 


6 


A 


Accessed 


5 


KD 


Rage Cache Disable 


4 — 


m 


PageWritethrough 


3 


U/5 


Usef/Supervisor 


2 


W/R 


Write/Read 


1 — 


P 


Present (valid) 


0 



Figure 40. Page Directory Entry 4-Kbyte Page Table (PDE) 



22 21 



12 11 10 9 







A 


Physical Page Base Address 




V 

L 





Description 




AVL 


Available to Software 


n-9— 




Resefved 


8 - 


PS 


Page Size 


7 — 




Reserved 


6 — 


A 


Accessed 


5 — 


PCD 


Page Cache Disable 


4 — 


pwr 


PageWritethrough 


5 — 


u/s 


User/Supervisor 


2 — 


W/R 


Write^ad 


I — 


P 


Present (valid) 


0 - 




5 4 3 2 10 





P 


P 


U 


W 




A 


C 


W 


/ 


/ 


P 




D 


T 


5 


R 





Figure 41. Page Directory Entry 4-Mbyte Page Table (PDE) 
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31 12 11 10 9 8 7 6 5 4 3 2 1 0 





A M 






P 


P 


u 


w 




Physical Page Base Address 






A 


C 


W 




/ 


P 


^ i 






D 


r 


s 


R 





Symbol 


Description 




AVL 


Avaibble to Software 


11-9 — 




Reserved 


8-7 — 


D 


Diity 


6 — 


A 


Accessed 


5 — 


PCD 


Page Cache Disable 


4 — 


pwr 


PageWritethrojgh 


3 — 


u/s 


User/Supervisor 


2 


W/R 


Write/Read 


1 — 


p 


Present (valid) 


0 — 



Figure 42. Page Table Entry (PTE) 

Descriptors and Gates 

There are various types of structures and registers in the x86 
architecture that define, protect, and isolate code segments, 
data segments, task state segments, and gates. These structures 
are called descriptors. 

Figure 43 on page 49 shows the application segment descriptor 
format. Table 9 on page 49 contains information describing the 
memory segment type to which the descriptor points. The 
application segment descriptor is used to point to either a data 
or code segment. 

Figure 44 on page 50 shows the system segment descriptor 
format. Table 10 on page 50 contains information describing 
the type of segment or gate to which the descriptor points. The 
system segment descriptor is used to point to a task state 
segment, a call gate, or a local descriptor table. 

The AMD-K6 3D processor uses gates to transfer control 
between executable segments with different privilege levels. 
Figure 45 on page 51 shows the format of the gate descriptor 
types. Table 10 contains information describing the type of 
segment or gate to which the descriptor points. 
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r 



Symboi Description Bib 

• G Granularity 23 
- D 32-Bit/lHit 22 
■ AVI Available to Software 20 

• P Present/ValidBit 15 

• DPL Descriptor Privilege Level ]4-15 

• DT Descriptor Type 12 
Type See Table 9 11-8 



31 30 29 28 27 25 25 24 


23 


22 


21 


20 


19 16 17 16 


15 


14 13 


12 


11 10 9 B 


7 6 5 4 3 2 1 0 


Base Address 31-24 


G 


D 


w 


A 
V 

!L 


Segment 
Limit 


P 


DPL 


1 


Type 


Base Address 23-16 


Base Address 15-0 


Segment Limit 15-0 



Figure 43. Application Segment Descriptor 



Table 9. Application Segment Types 



Type 


Data/Code 


Description 


0 




Read-Only 


1 




Read-Only-Accessed 


2 




Read/Write 


3 


Data 


Read/Write-Accessed 


4 


Read-Only-Expand-down 


5 




Read-Only- Expand-down, Accessed 


6 




Read/Write- Expand-down 


7 




Read/Write- Expand-dovm, Accessed 


8 




Execute-Only 


9 




Execute-Only- Accessed 


A 




Execute/Read 


B 


Code 


Execut^Read-Accessed 


C 


ExecuteOnly-Conforming 


D 




Execute-Only- Conforming, Accessed 


E 




Execute/Read-Only- Confoiming 


F 




Execut^Read-Only-Confonnlng. Accessed 
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r 



G Granubrity 23 

X NotNeeded 22 

AVL Availability to Software 20 

P PresenVValid Bit 15 

DPL Descriptor Privilege Level 14-13 

DT Descriptor Type 12 

Type See Table 10 n-B 



51 30 29 28 27 25 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 



Base Address 31-24 


G 


III 


Segment 
Limit 


P 


DPL 


0 


Type 


Base Address 23-16 


Base Address 15-0 


St'gmenl limit 15-0 



Rgure44. System Segment Desaiptor 



Table 10. System Segment and Gate Types 



Type 


Description 


0 


Reserved 


1 


Available 16-bit TSS 


2 


LDT 


3 


Busy 16-bit TSS 


4 


16-bit Call Gate 


5 


Task Gate 


6 


I64}it Interrupt Gate 


7 


16-bit Trap Gate 


8 


Reserved 


9 


Available 32-bit TSS 


A 


Reserved 


B 


Busy 32-bit TSS 


C 


32'brt Call Gate 


D 


Reserved 


E 


32-bit Interrupt Gate 


F 


32-bit Trap Gate 
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Symbol 

- P 

- DPL 

- DT 

- Type 



Descriptipn 
Presen^Valid Bit 
Descriptor Prwilege Level 
Descriptor Type 
See Table 10 



15 

14-13 
12 

11^ 



51 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 $ 5 4 5 2 1 0 



Offset 31-16 



Segment Seiedor 



DPL 



Type 



Offset 15-0 




Figure 45. Gate Descriptor 

Exceptions and Interrupts 

Table 11 summarizes the exceptions and interrupts. 
Table 11. Summary of Exceptions and Intemipb 



Interrapt 
Number 


Interrupt Type 


Cause 


0 


Divide by Zero Error 


DIV, IDIV 


1 


Debug 


Debug trap or fault 


2 


NoivMaskable Interrupt 


NMl signal sampled asserted 


3 


Breakpoint 


Int3 


4 


Overflow 


INTO 


5 


Bounds Check 


BOUND 


6 


Invalid Opcode 


Invalid instruction 


7 


Device Not Available 


ESC and WAIT 


8 


Double Fault 


Fault occurs while handling a fault 


9 


Reserved - Interrupt 13 




10 


Invalid TSS 


Task switch to an invalid segment 


11 


Segment Not Present 


Instruction loads a segment and present bit is 0 (invalid segment) 


12 


Stack Segment 


Stack operation causes limit violation or present bft is 0 


15 


General Protection 


Segment related or miscellaneous invalid actions 


14 


Page Fault 


Page protection violation or a reference to m\ssang page 


16 


Roating-Point Error 


Arithmetic error generated by floating-point instruction 


17 


Alignment Check 


Data reference to an unaligned operand. (The AC flag and the AM bit of CRO are 
set to 1.) 


0-255 


Software Interrupt 


INTn 
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Instructions Supported by the Processor 



This section documents all of the x86 instructions supported by 
the AMD-K6 3D processor. The following tables show the 
instruction mnemonic, opcode, modR/M byte, decode type, and 
RISC86 operation(s) for each instruction. Tables 12 through 15 
define the integer, floating-point, MMX, 3D instructions, and 
new instructions for the processor, respectively. 

The first column in these tables indicates the instruction 
mnemonic and operand types with the following notations: 

■ regS — byte integer register defined by instruction byte(s) or 
bits 5, 4, and 3 of the modR/M byte 

■ mregS — byte integer register or byte integer value in 
memory defined by the modR/M byte 

■ regl6/32 — word or doubleword integer register defined by 
instruction byte(s) or bits 5, 4, and 3 of the modR/M byte 

■ mregl6/32 — word or doubleword integer register, or word or 
doubleword integer value in memory defined by the 
modR/M byte 

■ memS — byte integer value in memory 

■ meml6/32 — word or doubleword integer value in memory 

■ mem32/48 — doubleword or 48-bit integer value in memory 

■ mem48 — 48-bit integer value in memory 

■ mem64 — 64-bit value in memory 

■ tmmS — 8-bit immediate value 

■ imml 6/32 — 16-bit or 32-bit immediate value 

■ disp8 — 8-bit displacement value 

■ displ 6/32 — 16-bit or 32-bit displacement value 

■ disp32/48 — doubleword or 48-bit displacement value 

■ eXX — register width depending on the operand size 

■ mem32real — 32-bit floating-point value in memory 

■ mem64redl — 64-bit floating-point value in memory 

■ memSOreal — 80-bit floating-point value in memory 

■ mmreg — ^MMX/3D register 

■ mmregl — MMXBD register defined by bits 5, 4, and 3 of the 
modR/M byte 

■ mmre^2— MMX/3D register defined by bits 2, 1, and 0 of the 
raodR/M byte 
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The second and third columns list all applicable opcode bytes. 

The fourth column lists the modR/M byte when used by the 
instruction. The modR/M byte defines the instruction as a 
register or memory form. If modR/M bits 7 and 6 are documented 
as mm (memory form), mm can only be 10b, 01b or 00b. 

The fifth column lists the type of instruction decode— short, 
long, and vector. The processor decode logic can process two 
shon, one long, or one vector decode per clock. 

The sixth column lists the type of RISC86 operation(s) required 
for the instruction. The operation types and corresponding 
execution units are as follows: 

■ load, fload, mload — load unit 

■ store, f store, mstore — store unit 

■ alu — either of the integer execution units 

■ alux — integer X execution unit only 

■ branch — branch condition imit 

■ float — floating-point execution unit 

■ mmx — MMX execution unit for multimedia software 

■ 3D — 3D instructions execution unit 

■ limm — load immediate, instruction control unit 



Table U. Integer instructions 



Instruction Mnemonic 


Fii5t 
Byte 


Second 
Byte 


ModR/M 
Byte 


Decode 
Type 


RISC86 
Opcodes 


AAA 


37h 






vector 




AAD 


D5h 


OAh 




vector 




AAM 


D4h 


OAh 




vector 




AAS 


3Fh 






vector 




ADC mregS, regS 


lOh 




n-xxx-xxx 


short 


alux 


ADC memS, regS 


lOh 




mm-xxx-xxx 


long 


load, alux, store 


ADCmregl6/32,regl6/32 


llh 




11-xxx-xxx 


short 


aiu 


ADCmeml5/32.regl6/32 


llh 




mm-xxx-xxx 


long 


load, atu, store 


ADC reg8, mreg8 


12h 




11-xxx-xxx 


short 


alux 


ADC regS, memS 


12h 




miTv-xxx-xxx 


short 


load, alux 


ADC regl6/32, mregl6/32 


I3h 




11 -xxx-xxx 


short 


atu 


ADCreg16/32. mem 16/32 


13h 




mm-xxx-xxx 


short 


load, atu 
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Table 12. Integer Instructions (continued) 



Instruction Mnemonic 


First 
Byte 


Second 
Byte 


ModR/M 
Byte 


Decode 
Type 


RISC86 
Opcodes 


AUL AL, immo 


14h 




xx-xxx-xxx 


snort 


alux 


AlA. cAa, immi6/32 


15n 




xx-xxx-xxx 


snort 


alu 


ADC mregS, imm8 


80h 




11-OIO-xxx 


short 


alux 


ADC memS, ImmB 


80h 




mm-010-xxx 


long 


load, alujc store 


ADC mregl6/32, imml^32 


81 h 




11-OIO-xxx 


short 


alu 


ADC mem 16/32, imml^32 


81 h 




mm-OlO-xxx 


long 


load, alu, store 


ADC mregl6/32, immS (signed ext.) 


83h 




11-OlO-xxx 


short 


alux 


ADC memi6/32, imm8 (signed exL) 


83h 




mm-OlO-xxx 


long 


load, alux, store 


ADD mregS, regS 


OOh 




n-xxx-xxx 


short 


alux 


ADD mem8, reg8 


OOh 




mm-xxx-xxx 


long 


load, alux, store 


ADD mregl6/32, regl6/32 


01 h 




ll-xxx-xxx 


short 


alu 


ADD mem 16/32, regl6/32 


01 h 




mm-xxx-xxx 


long 


load, alu, store 


ADD reg8, mregS 


02h 




11-xxx-xxx 


short 


alux 


ADD regS. mem8 


02h 




mm-xxx-xxx 


short 


load, alux 


ADD regl6/32, mregl6/32 


03h 




11-xxx-xxx 


short 


alu 


ADD reg16/32, mem16/32 


03h 




mm-xxx-xxx 


short 


load, alu 


ADD AL, imm8 


04h 




xx-xxx-xxx 


short 


alux 


ADD EAX imm 16/32 


05h 




xx-xxx-xxx 


short 


alu 


ADD mregS, immS 


80h 




M-OOO-XXX 


short 


alux 


ADD mem8, imm8 


BOh 




mm-OOO-xxx 


long 


load, alux, store 


ADD mreg16/32, imml6/52 


81 h 




11-OOO-xxx 


short 


alu 


ADD mem 16/32, imml6/32 


81 h 




mm-OOO-xxx 


long 


load, alu, store 


ADD mregl6/32, immS (signed exL) 


83h 




ll-OOO-xxx 


short 


alux 


Auu memio/^^, imniD (signed exL} 


83 h 




mm-OOO-xxx 


long 


load, alux, store 


AND mreg8, regS 


20h 




11-xxx-xxx 


short 


alux 


AND mem8, reg8 


20h 




mm-xxx-xxx 


long 


load, alux, store 


AND mreg16/32, reg16/32 


21 h 




11-xxx-xxx 


short 


alu 


AND mem16/32. regl6/32 


21 h 




mm-xxx-xxx 


long 


load, alu, store 


AND regO, mregS 


22h 




11-xxx-xxx 


short 


alux 


AND reg8, mem8 


22h 




mm-xxx-xxx 


short 


load, alux 


AND regl5/32, mregl6/32 


23h 




11-xxx-xxx 


short 


alu 


ANDregW32, mem16/32 


23h 




mm-xxx-xxx 


short 


load, alu 


ANDAUimmS 


24h 




xx-xxx-xxx 


short 


alux ' 
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TaMe 12. Integer Instructions (continued) 



Instruction Mnemoiuc 


Byte 


Byte 


iTIUQn/ 111 

Byte 


Decode 
Type 


RISCB6 
Opcodes 


AND EAX, iinm1^32 


25h 




xx-xxx-xxx 


short 


alu 


AND mregO, immS 


80h 




1M00-XXX 


short 


alux 


AND memB, imm8 


80h 




mm-lOO-xxx 


long 


load, alux, store 


AND mreg16/32, imm16/32 


8th 




1M0O-XXX 


short 


atu 


AND mem16/32, imm 16/32 


81h 




mm-lOO-xxx 


long 


load, alu, store 


AND mreglG/32, imm8 (signed ext) 


83h 




1M0O-XXX 


short 


alux 


AND mem 16/32, immS (signed ext) 


83h 




mm-lOO-xxx 


long 


load, alux, store 


ARPL mreg16, reg16 


63h 




ll-xxx-xxx 


vector 




ARPL mem 16, regis 


63h 




mm-m-xxx 


vector 




BOUND 


62h 




xx-xxx-xxx 


vector 




BSF reg16/32, mregl6/32 


OFh 


BCh 


11-xxx-xxx 


vector 




BSFreg 16/32, meml6/32 


OFh 


BCh 


mn>-xxx-xxx 


vector 




BSR reg16/32, mreg16/32 


OFh 


BDh 


IVxxx-xxx 


vector 




6SR reg16/32, mem16/32 


OFh 


BDh 


mm-xxx-xxx 


vector 




BSWAPEAX 


OFh 


C8h 




long 


alu 


BSWAP Ea 


OFh 


C9h 




long 


alu 


BSWAP EDX 


OFh 


CAh 




long 


alu 


BSWAP EBX 


OFh 


CBh 




long 


alu 


BSWAP ESP 


OFh 


CCh 




long 


alu 


BSWAP EBP 


OFh 


CDh 




long 


alu 


BSWAP ESI 


OFh 


CEh 




long 


alu 


BSWAP EDI 


OFh 


CFh 




long 


alu 


BT mreg16/32, regl6/32 


OFh 


A3h 


11-xxx-xxx 


vector 




BTmem16/32,reg16/32 


OFh 


A3h 


mm-xxx-xxx 


vector 




BTmreg 16/^2, ImmS 


OFh 


BAh 


iMOO-xxx 


vector 




BTmem16/52,lmm8 


OFh 


BAh 


mm-lOO-xxx 


vector 




BTC mreg16/32, reg16/32 


OFh 


BBh 


11-xxx-xxx 


vector 




BTC mem16/32, reg16/32 


OFh 


BBh 


mm-xxx-xxx 


vector 




BTC mreg16/32, immB 


OFh 


BAh 


ll-ni-xxx 


vector 




BTCmeml6/32,imm8 


OFh 


BAh 


mm-111-xxx 


vector 




BTR mregl6/32, reg16/32 


OFh 


B3h 


11-xxx-xxx 


vector 




BTRmeml6/32, regl6/32 


OFh 


B3h 


mm-xxx-xxx 


vector 




BTRmreg16/32, ImmB 


OFh 


BAh 


IMIO-xxx 


vedor 
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Table 12. Integer Instructions (continued) 



Instruction Mnemonic 


First 


Second 
isyie 


ModR/M 
pyie 


Decode 
Type 


RISC86 
Opcodes 


DIK ineiTllD/dZ, liTllTIO 


Urn 


DAh 

bAn 


mm*110-xxx 


vector 




Did mregit/^Zr regib/5Z 


OFh 


Aon 


U-xxx-xxx 


vector 




Di5 mem lb/32, regi(Y32 


Orfl 


ADk 

Abn 


mm-xxx-xxx 


vector 




bib mreg 10/52, imms 


Orn 


BAn 


ll-lOl-XXX 


vector 




bli menil6/52, immS 


OFh 


BAh 


mm- 101 -XXX 


vector 




CALL Tull pointer 


9An 






vector 




CALL near imml6/32 


E8h 






snort 


store 


CALL mem 16: 16/32 


FFh 




11-011-xxx 


vector 




CALL near mreg32 (indirect) 


FFh 




11-010-xxx 


vector 




CALL near mem32 (indirect) 


FFh 




mm-OlO-xxx 


vector 




CBW/CWDe eax 


98h 






vector 




CLC 


F8h 






vector 




CLD 


FCh 






vector 




CLI 


FAh 






vector 




CLTS 


OFh 


06h 




vector 




CMC 


F5h 






vector 




CMP mreg8, regS 


38h 




11-xxx-xxx 


short 


alux 


CMP mem8, regS 


38h 




mm-xxx-xxx 


short 


load, alux 


CMP mreg 16/32, reg16/32 


39h 




11-xxx-xxx 


short 


alu 


CMP mem 16/32, regl6/32 


59h 




mm-xxx-xxx 


short 


load, alu 


CMP reg8, mregS 


3An 




1 1 -xxx-xxx 


short 


alux 


CMP reg8, memS 


3Ah 




mm-xxx-xxx 


snort 


load, alux 


CMP regl6/32, mregl6/32 


3Bh 




U-xxx-xxx 


short 


alu 


Livir regib/32, mem 10/32 


7DU 

3 on 




mm-xxx-xxx 


short 


toad, alu 


CMPAUimmd 


3Ch 




xx-xxx-xxx 


short 


alux 


CMP EAX, imm 16/32 


3Dh 




xx-xxx-xxx 


short 


atu 


CMP mregS, immS 


80h 




ll-111-xxx 


short 


alux 


CMPmem8,imm8 


80h 




mm-111-xxx 


short 


load, alux 


CMP mregl6/32, imml6/32 


81 h 




11-in-xxx 


short 


alu 


CMPmeml6/32. imm16/32 


81 h 




mm-111-xxx 


short 


load, alu 


CMP mreg16/32, imm8 (signed exl) 


83h 




11-ni-xxx 


long 


load, alu 


CMP mem 16/32. immS (signed ext.) 


83h 




mm-111-xxx 


long 


load, alu 


CMPSB mem8,mem8 


A6h 






vector 
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Table 12. Integer Instructions (continued) 



Instntction Mnemonic 


Byte 


Byte 


Byte 


Decode 
Type 


RiSC86 
Opcodes 


CMPSWmeml6,mem32 


A7h 






vector 




CMPSD mem32, mem32 


A7h 






vector 




CMPXCHC mregS, regS 


OFh 


BOh 


n-xxx-xxx 


vector 




CMPXCHG mem8, reg8 


OFh 


BOh 


mm-xxx-xxx 


vector 




CMPXCHG mreg16/32, reg 16/32 


OFh 


Bih 


11-xxx-xxx 


vector 




CMPXCHG mem16y32, reg16/32 


OFh 


Bih 


mm-xxx-xxx 


vector 




CMPXCH8B EDX:EAX 


OFh 


C7h 


1 1 -xxx-xxx 


vector 




CMPXCH8B metn64 


OFh 


C7h 


mm-xxx-xxx 


vector 




CPUID 


OFh 


A2h 




vector 




CWD/CDQ EDX, EAX 


99h 






vector 




DAA 


27h 






vector 




DAS 


j 2Fh 






vector 




DEC EAX 


48h 






short 


alu 


DEC EQ 


49h 






short 


alu 


DEC EDX 


4Ah 






short 


alu 


DEC EBX 


4Bh 






short 


alu 


DEC ESP 


4Ch 






short 


alu 


DEC EBP 


4Dh 






short 


alu 


DEC ESI 


4Eh 






short 


alu 


DEC EDI 


4Fh 






short 


atu 


DEC mregS 


FEh 




11-001 -XXX 


vector 




DEC menriB 


FEh 




mm-OOl -xxx 


long 


load. alux. store 

>W\Jf MIWAf 


DEC mregl6/32 


FFh 




U-OOl-xxx 


vector 




DECmeml6/32 


FFh 




mm-OOl-xxx 


long 


load, alu, store 


DIVAU mregS 


F6h 




IMlO-xxx 


vector 




DIVAL, mems 


F6h 




mm-UO-xxx 


vector 




DIVEAX,mregl6/32 


Rh 




IMlO-xxx 


vector 




DIVEAX,mem16/32 


F7h 




mm-llO-xxx 


vector 




IDIV mregS 


F6h 




11-111-xxx 


vector 




IDIVmem8 


F6h 




mm-111-xxx 


vector 




IDIV EAX mregl6/52 


F7h 




IMll-xxx 


vector 




IDIV EAX. mem16/32 


nh 




mm-lll-xxx 


vector 




IMULreg16/32, imm16/32 


69h 




11-m-xxx 


vector 
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Table 12, Integer Instructions (continued) 



Instruction Mnemonic 


First 
Byte 


Second 
Byte 


ModR/M 
Byte 


Decode 
Type 


RISC86 
Opcodes 


IMUL regl6/32, mregl6/32, imml6/32 


69h 




n-xxx-xxx 


vector 




1MUL regl6/32, ineml6/32, imml6/32 


69h 




mm-xxx-xxx 


vector 




IMUL regicy32, immS (sign extended) 


6Bh 




ll-xxx-xxx 


vector 




ll\AUL reg16/32, mregl6/32, immd 
(signed) 


oBn 




1 1 -xxx-xxx 


vector 




IMUL reg16/32, meml6/32, immS 
(signed) 


66h 




mm-xxx-xxx 


vector 




IMULAXAL,mreg8 


F6h 




IMOl-xxx 


vector 




IMUL AX AL, memS 


F6h 




mm-ioi-xxx 


vector 




IMUL EDXiEAX, EAX, mregl6/32 


F7h 




1M01-XXX 


vector 




IMUL EDX:EAX, EAX, mem 16/52 


F7h 




mm-101-xxx 


vector 




IMULregl6/32,mregl6/32 


OFh 


AFh 


1 1 -xxx-xxx 


vector 




IMUL reg16/32, mem16/32 


OFh 


AFh 


mm-xxx-xxx 


vector 




INC EAX 


40h 






short 


aiu 


INC EQ 


41 h 






short 


alu 


INC EDX 


42h 






short 


alu 


INC E6X 


43h 






short 


alu 


INC ESP 


44h 






short 


alu 


INC EBP 


45h 






short 


alu 


INC ESI 


46h 






short 


alu 


INC EDI 


47h 






short 


alu 


INC mregS 


FEh 




11-000-xxx 


vector 




INC memS 


FEh 




mm-OOO-xxx 


long 


load, alux, store 


INC mregl6/32 


FFh 




ll-OOO-xxx 


vector 




INC mem16/32 


FFh 




mm-OOO-xxx 


long 


load, alUi store 


INVD 


OFh 


08h 




vector 




INVLPG 


OFh 


01 h 


mm-lll-xxx 


veaor 




JO short disp8 


70h 






short 


branch 


iB/JNAE short disp8 


7lh 






short 


branch 


JNO short disp8 


71h 






short 


branch 


JNB/JAE short disp8 


73h 






short 


branch 


JZ/JE short dispa 


74h 






short 


branch 


JN^JNE short dtsp8 


75h 






short 


brandi 


iB^JNA short disp8 


76h 






short 


branch 
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Table 12. Integer Instnicdons (continued) 



Instruction Mnemonic 


First 
Byte 


Second 
Byte 


ModR/M 
Byte 


Decode 
Type 


RISC88 
Opcodes 


JNBE/JA short disp8 


77h 






short 


branch 


JS short disp8 


78h 






short 


branch 


JNS short disp8 


79h 






short 


branch 


JP/JPE short disp8 


7Ah 






short 


branch 


iNP/JPO short dispS 


7Bh 






short 


branch 


JiyjNGE short disp8 


7Ch 






short 


branch 


JNiyJCE short disp8 


7Dh 






short 


branch 


JL^JNG short dispa 


7Eh 






short 


branch 


JNL^JG short dispS 


7Fh 






short 


branch 


JCXZ/JEC short dlsp8 


E3h 






vector 




JO near dispi6/32 


OFh 


BOh 




short 


branch 


JNO neardisp16/32 


OFh 


81 h 




short 


branch 


iVJNAEneardisp16/32 


OFh 


82h 




short 


branch 


JNB/JAEneardlspl6/32 


OFh 


83h 




short 


branch 


J2/JE near disp16/32 


OFh 


84h 




short 


branch 


JN:VJNEneardispl6/32 


OFh 


85h 




short 


branch 


JBE/JNAneardbpI6/32 


OFh 


86h 




short 


branch 


JNB^JAnear dispt6/32 


OFh 


87h 




short 


branch 


JS near displ6/32 


OFh 


B8h 




short 


branch 


JNS neardisp]6/32 


OFh 


89h 




short 


branch 


JP/JPE neardisp?6/32 


OFh 


8Ah 




short J 


branch 


JNP/JPO neardispl6/32 


OFh 


8Bh 




short 


branch 


Jl/iNGE near displ6/32 


OFh 


8Ch 




short 


branch 


JNiyJGE neardispl6/32 


OFh 


8Dh 




short 


branch 


JL^iNG near displ6/32 


OFh 


8Eh 




short 


branch 


JNL^iG near disp16/32 


OFh 


8Fh 




short 


branch 


JMP near displV32 (direct) 


E9h 






short 


branch 


JMP far disp32/48 (direct) 


EAh 






vector 




JMPdispS (short) 


EBh 






short 


branch 


JMP far fnreg32 (indirect) 


EFh 




n-ioi-xxx 


vector 




JMP far mem32 (indirect) 


EFh 




mm-lOl-xxx 


vector 




JMP near mreg16/32 (indirect) 


FFh 




IMOO-xxx 


vector 




JMP near mem16/32 (indirect) 


FFh 




mm-100-xxx 


vector 





59 



177AMD0060095 
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Instruction Mnemonic 


Hrst 
Byte 


Second 
Byte 


ModR/M 
Byte 


Decode 
Type 


RISC86 
Opcodes 


LAHF 


9Fh 






vedor 




LAR regl6/32, mregl6/32 


OFh 


02h 


ll'XXX-XXX 


vector 




LARregl6/32, meml6/32 


OFh 


02h 


mm-xxx-xxx 


vector 




LDS regl6/32, mem32/48 


C5h 




mm-xxx'xxx 


vector 




LEAregl6/32, meml6/32 


8Dh 




mm-xxx-xxx 


short 


load, alu 


LEAVE 


C9h 






long 


load, alu, alu 


LESreg16/32, mem32/48 


C4h 




mm-xxx-xxx 


vector 




LF5 reg16/32, mem32/48 


OFh 


B4h 




vector 




LGDT mem48 


OFh 


01 h 


mm-OlO-xxx 


vector 




LGS reg16/32, mem32/48 


OFh 


B5h 




vector 




LIDT mem48 


OFh 


01 h 


mm-011-xxx 


vector 




LLDT mregie 


OFh 


OOh 


11-010-xxx 


vector 




LLDT mem16 


OFh 


OOh 


mm-010-xxx 


vector 




LMSWmregie 


OFh 


01 h 


IMOO-xxx 


vector 




LMSWmemie 


OFh 


01 h 


mm-lOO-xxx 


vector 




LODSB AU mem8 


ACh 






long 


load, alux 


L0DSWAX,memI6 


ADh 






long 


load, alu 


LODSD EAX, mem32 


ADh 






long 


toad, alu 


LOOPdispS 


E2h 






short 


alu, branch 


LOOP^LOOPZdispS 


Elh 






vector 




LOOPNE/LOOPN2disp8 


EOh 






vector 




LSLreg16/32, mreg16/32 


OFh 


03h 


11-xxx-xxx 


vector 




LSL regl6/32, meml6/32 


OFh 


03h 


mm-xxx-xxx 


vector 




LSS regl6/32, mem32/48 


OFh 


B2h 


mm-xxx-xxx 


vector 




LTRmreglG 


OFh 


OOh 


U-Oll-xxx 


vector 




LTR memie 


OFh 


OOh 


mm-011'xxx 


vector 




MOV mregS, regS 


88h 




11-xxx-xxx 


short 


alux 


MOV memS, reg8 


88h 




mm-xxx-xxx 


short 


store 


MOVmregl6/32,regl6/32 


89h 




11-xxx-xxx 


short 


alu 


MOVmeml6/32,regl6/32 


89h 




mm-xxx-xxx 


short 


store 


MOV regS, mregS 


8Ah 




n-xxx-xxx 


short 


alux 


MOV regS, mem8 


8Ah 




mm-xxx-xxx 


short 


load 


MOVregl6/32, mregl6/32 


8Bh 




n-xxx-xxx 


short 


alu 
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Table 12. Integer instructions (continued) 



Instruction Mnemonic 


First 
Byte 


Second 
Byte 


ModR/M 
Byte 


Decode 
Type 


RISC86 
Opcodes 


nfluv reg 16/32, mem 16/32 


8Bh 




mm-xxx-xxx 


snort 


load 


MUV mreg 16, segment reg 


8Cn 




1 1 -xxx-xxx 


long 


load 


MOV mem 16, segment reg 


8Ql 




mm-m-xxx 


vector 




MOV segment reg, mreg 16 


8Eh 




11 -xxx-xxx 


vector 




MOV segment reg, mem 16 


8Eh 




mm-xxx-xxx 


vector 




MOV AL, memS 


AOh 






short 


load 


MOV EAX, mem 16/32 


Alh 






short 


load 


MOV mem8, AL 


A2h 






short 


store 


MOV mem 16/32, EAX 


A3h 






short 


store 


MOV AL, iinmS 


BOh 






short 


limm 


MOV Cl, immS 


Bih 






short 


Itmm 


MOV DL, immS 


B2h 






short 


limm 


MOV 6L, immS 


B3h 






short 


limm 


M0VAH,imm8 


B4h 






short 


limm 


MOV CH. imm8 


B5h 






short 


limm 


MOV DH, immS 


B6h 






short 


limm 


MOV BH, imm8 


B7h 






short 


limm 


MOV EA)C imm 16/32 


BBh 






short 


limm 


MOVECX imm 16/32 


B9h 






short 


limm 


MOV EDX imml6/32 


BAh 






short 


limm 


MOV EBX imm16/32 


BBh 






short 


limm 


MOV ESP, imml^32 


BCh 






short 


limm 


MOV EBP, imml6/32 


BDh 






short 


limm 


MOV £51, imm 16/32 


BEh 






short 


limm 


MOV EDUmml6/32 


BFh 






short 


limm 


MOV mregS, imm8 


C6h 




n-ooo-xxx 


short 


limm 


MOV memS, immS 


C6h 




mm-OOO-xxx 


long 


store 


MOVregl6/32, imml6/32 


C7h 




ll-OOO-xxx 


short 


limm 


MOV mem16/32, imm16/32 


C7h 




mm-OOO-xxx 


long 


store 


MOVSB mem8,mem8 


A4h 






long 


load, store, alux, alux 


MOVSD mem 16. mem 16 


A5h 






long 


load, store, alu, alu 


MOVSWmem32, mem32 


A5h 






long 


load, store, alu, alu 


MOVSXregl6/32.mreg8 


OFh 


BEh 


11 -xxx-xxx 


short 


alu 
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Table 12. Integer Instructions (continued) 



Instruction Mnemonic 


First 
Byte 


Second 
Byte 


ModR/M 

Byte 


Decode 
Type 


RISCB6 
Opcodes 


MOVSXregl6/32, memS 


OFh 


BEh 


mm-xxx-xxx 


short 


load, alu 


MOVSX reg32, mregl6 


OFh 


BFh 


ll-XXX-XXX 


short 


alu 


MOVSXreg32,mem]6 


OFh 


BFh 


mm-xxx-xxx 


short 


load, alu 


MOVZXreg16/32, mregS 


OFh 


B6h 


11-m-xxx 


short 


alu 


MOVZX regl6/32, memS 


OFh 


B6h 


mm-xxx-xxx 


short 


load, alu 


MOVZX reg32, mregl6 


OFh 


B7h 


ll-XXX-XXX 


short 


alu 


MOVZX reg32, mem 16 


OFh 


B7h 


mm-xxx-xxx 


short 


load, alu 


MULAL,mreg8 


F6h 




IMOQ-xxx 


vector 




MULAL,mem8 


F6h 




mm-lOO-xxx 


vector 




MULEAXmregl6/32 


F7h 




n-lOO-xxx 


vector 




MULEAX.mem16/32 


F7h 




mm-100-xxx 


vector 




NEC mregS 


F6h 




n-011-xxx 


short 


alux 


NEC mem8 


F6h 




mm-011-xxx 


vector 




NEC mreg16/32 


F7h 




11-011-xxx 


short 


alu 


NEC mem16/32 


F7h 




mm-011-xxx 


vector 




NOPpCCHGAXAX) 


90h 






short 


limm 


NOT mregS 


F6h 




11-010-xxx 


short 


alux 


NOT memS 


F6h 




mm-010-xxx 


vector 




NOT mreg16/32 


F7h 




n^io-xxx 


short 


alu 


NOT meml6/32 


F7h 




mm-OlO-xxx 


vector 




OR mregS, regS 


08h 




1 1 -xxx-xxx 


short 


alux 


OR memS, reg8 


08h 




mm-xxx-xxx 


long 


load, alux, store 


0Rmreg16/32,iieg16/32 


09h 




U -xxx-xxx 


short 


alu 


OR meml6/32, regl6/32 


09h 




mm-xxx-xxx 


long 


load, alu, store 


OR reg8, mreg8 


OAh 




11 -xxx-xxx 


short 


alux 


OR reg8, memS 


OAh 




mm-xxx-xxx 


short 


toad, alux 


OR regl6/32, nireg16/^2 


OBh 




11-xxx-xxx 


short 


alu 


OR regl6/32, meml6/32 


OBh 




mm-xxx-xxx 


short 


load, alu 


OR ALjmmS 


OCh 




xx-xxx-xxx 


short 


alux 


OREAXimm16/32 


ODh 




xx-xxx-xxx 


short 


alu 


OR mregS, immS 


80h 




11-001-xxx 


short 


alux 


OR mem8, immS 


80h 




mm-OOl-xxx 


long 


load, alux, store 


OR mreg16/32, imm16/32 


81 h 




11-OOl-xxx 


short 


alu 
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Table 12. Integer Instructions (continned) 



Instruction Mnemonic 


First 
Dyte 


Second 
tsyte 


ModR/M 
Byte 


Decode 
Type 


RISC86 
Opcodes 


UK mem 16/32, imm 16/32 


81 n 




mm-001-xxx 


long 


load, dlu, store 


OR mreg 16/32, imm8 (signed exL) 


83n 




il'OOi-xxx 


Short 


alux 


OR mem 16/32, immS (signed exL) 


83 n 




mm-OOl-xxx 


long 


load, alux, store 


POP ES 


07h 






vector 




POP SS 


17h 






vector 




POP DS 


IFh 






vector 




POP FS 


OFh 


Alh 




vector 




POP CS 


OFh 


A9h 




vector 




POP EAX 


58n 






snort 


load, dlu 


POP ECX 


59h 






snort 


load, dlu 


n/\n cr\v 

POP cDX 


5Ah 






short 


load, alu 


POP cBX 


SBn 






short 


load, alu 


lion f* cn 

POP ESP 


5Gi 






short 


load, alu 


POP EBP 


5Dh 






short 


load, alu 


POP ESI 


5Eh 






short 


load, alu 


POP EDI 


5Fh 






short 


load, alu 


POP mreg 


8Fh 




11-00O*XXX 


short 


load, alu 


POP mem 


8Fh 




mm-OOO-xxx 


long 


load, store, alu 


POP/vPOPAD 


61h 






vector 




POPF/POPFD 


9Dh 






vector 




PUSH ES 


06h 






long 


load, store 


PUSH CS 


OEh 






vector 




PUSH FS 


OFh 


AOh 




vector 




Dl ICU 


Orn 


A8h 




vector 




PUSH SS 


16h 






vector 




PUSH DS 


lEh 






long 


load, store 


PUSH EAX 


50h 






short 


store 


PUSHEa 


51h 






short 


store 


PUSH EDX 


52h 






short 


store 


PUSH EBX 


53h 






short 


store 


PUSH ESP 


54h 






short 


store 


PUSH EBP 


55h 






short 


store 


PUSH ESI 


56h 






short 


store 
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Table 12. Integer Instructions (continued) 



Inctnirfinn MiukmAiiir 
iiDu ukuvn miiiviiiuim 


First 
Byte 


Second 
Byte 


ModR/M 
Byte 


Decode 
Type 


R1SC86 
Opcodes 


PUSH EDI 


57h 






short 


store 


PUSH imm8 


6Ah 






long 


store 


PUSH inim16/32 


68h 






long 


store 


PUSH mregl6/32 


FFh 




IMlO-m 


vector 




PUSH meml6/32 


FFh 




mm-no-xxx 


long 


load, store 


PUSHA/PUSHAD 


60h 






vector 




PUSHF/PUSHFD 


9Ch 






vector 




RCL mregS, imm8 


COh 




ll-OlO-xxx 


vector 




RCL memS, imm8 


COh 




mm-OlO-xxx 


vector 




RCL mreg16/32, immS 


Clh 




11-OlO-m 


vector 




RCL mem16/32, immS 


Clh 




mm-OlO-xxx 


vector 




RCL mregS, 1 


DOh 




ll-OlO-xxx 


vector 




RCL memS, 1 


DOh 




mm-OlO-xxx 


vector 




RCLnfveg16/32J 


Dlh 




11-OlO-xxx 


vector 




RCL memW32J 


Dlh 




mm-OlO-xxx 


vector 




RCL mregS, CL 


D2h 




n-oio-xxx 


vector 




RCL memS, CL 


D2h 




mm-OlO-m 


vector 




RCL mregl6/32, CL 


D3h 




11K)10-xxx 


vector 




RCLmemiy32, CL 


03h 




mm-OlO-xxx 


vector 




RCR mregS, immS 


COh 




11-on-xxx 


vector 




RCR memS, immS 


COh 




mm-OU-xxx 


vector 




RCR mregl6/32, immS 


Clh 




ll-Oll-xxx 


vector 




RCR mem 16/32, immS 


Clh 




mm-011-xxx 


vector 




RCR mregS, 1 


DOh 




11-Oll-xxx 


vector 




RCR memSil 


DOh 




mm-Oll-xxx 


vector 




RCRmreg16/32, 1 


Dlh 




n-Oll'Xxx 


vector 




RCR mem 16/32, 1 


Dlh 




mm-Oll-xxx 


vector 




RCR mrega, CL 


D2h 




u-on-xxx 


vector 




RCR mem8,CL 


D2h 




mm-011-xxx 


vector 




RCRmregl6/32,CL 


D3h 




11-Oll-xxx 


vector 




RCR mem 16/32, CL 


D3h 




mm-Oll-xxx 


vector 




RET near imml6 


C2h 






vector 




RET near 


C3h 






vector 
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Table 12. Integer Instructions (continued) 



Instruction Mnemonic 


First 
Byte 


Second 
Byte 


ModR/M 
Byte 


Decode 
Type 


RISCB6 
Opcodes 


RET far imm16 


CAh 






vector 




RET far 


CBh 






vector 




ROl mregS, irnmB 


COh 




IVOOO-xxx 


vector 




ROL memd, immS 


Coh 




mm-OOO-xxx 


vector 




ROL mreg16/32, immS 


Clh 




11-000-xxx 


vector 




ROLmeni16/}2, immS 


Clh 




mm-OOO-XXX 


vector 




ROLmregS, 1 


DOh 




11-000-xxx 


vector 




ROL mema, 1 


DOh 




mm-OOO-XXX 


vector 




ROL mreg16/32, 1 


Dih 




11-000-xxx 


vector 




ROL mem 16/32, 1 


Dlh 




mm-OOO-XXX 


vector 




ROL mregS, CL 


D2h 




11-COO-xxx 


vector 




ROL memS, CL 


D2h 




mm-OOO-XXX 


vector 




ROL mregl6/32, CL 


D3h 




11-OOO-xxx 


vector 




ROL meml6/32, CL 


D3h 




mm-OOO-XXX 


vector 




ROR mregS, immS 


COh 




11-001-xxx 


vector 




RORmem8,imm8 


COh 




mm-OOl-xxx 


vector 




ROR mregl6/32, immS 


Clh 




11 -001 -XXX 


vector 




ROR meml6/32, immS 


Clh 




mm-OOl-xxx 


vector 




ROR mregS, 1 


DOh 




11-001-xxx 


vector 




ROR memS, 1 


DOh 




mm-OOl-xxx 


vector 




RORmregl6/32j 


Dlh 




11-001-xxx 


vector 




R0Rmem1^2J 


Dlh 




mm-OOl-xxx 


vector 




ROR mregS, CL 


D2h 




11-001-xxx 


vector 




ROR memS, CL 


D2h 




mm-OOl-xxx 


vector 




RORmreg16/32,CL 


D3h 




n-OOl-xxx 


vector 




RORmem16/5Z CL 


D3h 




mm-OOl-xxx 


vector 




SAHF 


9Eh 






vector 




SAR mregS, immS 


COh 




IMll-xxx 


short 


alux 


SAR mem8, ImmS 


COh 




mm-lU-xxx 


vector 




SAR mregl6/32, immS 


Clh 




ll-lll-xxx 


short 


alu 


SAR mem 16/32, imm8 


Clh 




mm-111-xxx 


vector 




SAR mregS, 1 


DOh 




11-ni-xxx 


short 


alux 


SAR memS, 1 


DOh 




mm-111-xxx 


vector 
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Table 12. Integer Instructions (continued) 



InctttirtifMi Mnomflfiir 
lli>ullwUVll mllCinvllK 


First 
Byte 


Second 
Byte 


ModR/M 
Byte 


Decode 
Type 


RISC86 
Opcodes 


SAR mregl6/32, 1 


Dih 




11-ni-xxx 


short 


alu 


SAR fnem16/32, 1 


Dlh 




mm-11l-xxx 


vector 




SAR mreg8. CL 


D2h 




11-in-xxx 


short 


alux 


SAR mem8, CL 


D2h 




mm-111 -XXX 


vector 




SAR mregl6/32, CL 


D3h 




1M11-XXX 


short 


atu 


SAR meml6/32, CL 


D3h 




mm-lU-xxx 


vector 




SBB mregS, regS 


18h 




n-xxx-xxx 


short 


alux 


SBB rnemS, regS 


18h 




mm-xxx-xxx 


long 


load, alux, store 


SBB mregl6/32, regl6/32 


19h 




ll-xxx-xxx 


short 


alu 


SBB mem 16/32, regl6/32 


19h 




mm-xxx-xxx 


long 


load, alu, store 


SBB regB, mregS 


lAh 




U-xxx-xxx 


short 


atux 


SBB regB, memS 


lAh 




mm-xxx-xxx 


short 


load, alux 


SBB regl6/32, mregl6/32 


IBh 




U-xxx-xxx 


short 


alu 


SBB reg16/32, mem 16/32 


IBh 




mm-xxx-xxx 


short 


load, alu 


SBB AL, immS 


ICh 




xx-xxx-xxx 


short 


alux 


SBB EAXJmm 16/32 


iDh 




xx-xxx-xxx 


short 


alu 


SBB mregS, immS 


80h 




11-011-xxx 


short 


alux 


SBB mem8, immS 


80h 




mm-011-xxx 


long 


load, alux, store 


SBB mregl6/32, imml^32 


81 h 




ll-on-xxx 


short 


alu 


SBB meml6/32, imml6/32 


81 h 




mm-011-xxx 


long 


load, alu, store 


SBB mregS, immS (signed ext.) 


83h 




n-OU-xxx 


short 


alux 


SBB mem8, immB (signed exL) 


83h 




mm-011-xxx 


long 


load, alux, store 


SCASBAL, mem8 


AEh 






vector 




SCASWAX.meml6 


AFh 






vector 




SCASDEAX,mem32 


AFh 






vector 




SETO mregS 


OFh 


90h 


ll-xxx-xxx 


vector 




SETO memS 


OFh 


90h 


mm-xxx-xxx 


vector 




SETNO mregS 


OFh 


91 h 


ll-xxx-xxx 


vector 




SEFNO memS 


OFh 


91 h 


mm-xxx-xxx 


vector 




SETB/SETNAEmregB 


OFh 


92h 


ll-xxx-xxx 


vector 




SETB/SETNAE memS 


OFh 


92h 


mm-xxx-xxx 


vector 




SETNB/SETAEmregS 


OFh 


93h 


ll-xxx-xxx 


vector 




SETNB/SETAEmemS 


OFh 


93h 


mm-xxx-xxx 


vector 
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Table 12. Integer listructions (continued) 



Instruction Mnemonic 


First 
Byte 


Second 
Byte 


ModR/M 
Byte 


Decode 
Type 


RISGB6 
Opcodes 


SET^SETE mregS 


OFh 


94h 


U-xxx-xxx 


vector 




SET2ySETE memS 


OFh 


94h 


mnvxxx-xxx 


vector 




SETNZ/SETNEmregS 


OFh 


95h 


U-xxx-xxx 


veclor 




SETN2/SETNE memS 


OFh 


95h 


mm-xxx-xxx 


vector 




SETBiySETNAmregS 


OFh 


96h 


ll-xxx-xxx 


vector 




SETB^SETNAmemS 


OFh 


9eh 


mm-xxx-xxx 


vector 




SETNB^SETAmrega 


OFh 


97h 


ll-xxx-xxx 


vector 




SETNB^SETAmemS 


OFh 


97h 


mm-xxx-xxx 


vector 




SETS mregS 


OFh 


98h 


ll-xxx-xxx 


vector 




SETSmemS 


OFh 


98h 


mm-xxx-xxx 


vector 




SETNS mregS 


OFh 


99h 


ll-xxx-xxx 


vector 




SETNS memS 


OFh 


99h 


mm-xxx-xxx 


vector 




SETP/SETPE mregS 


OFh 


9Ah 


ll-xxx-xxx 


vector 




SETP/SETPE memS 


OFh 


9Ah 


mm-xxx-xxx 


vector 




SETNP/SETPO mregS 


OFh 


9Bh 


ll-xxx-xxx 


vector 




SETNP/SETPO memS 


OFh 


9Bh 


mm-xxx-xxx 


vector 




SETiySETNCEmregS 


OFh 


9Ch 


n-xxx-xxx 


vector 




SETl/SETNCE memS 


OFh 


9Ch 


mm-xxx-xxx 


vector 




SETNiySETGEmregS 


OFh 


9Dh 


ll-xxx-xxx 


vector 




SETNiySETGEmemS 


OFh 


9Dh 


mm-xxx-xxx 


vector 




SErL^SETNGmregS 


OFh 


9Eh 


ll-xxx-xxx 


vector 




SETL^SETNGmemB 


OFh 


9Eh 


mm-xxx-xxx 


vector 




SETNLt/SETCmregS 


OFh 


9Fh 


n-xxx-xxx 


vector 




SETNL^SETG rnemS 


OFh 


9Fh 


mm-xxx-xxx 


vector 




SGDT memia 


OFh 


01 h 


mm-OOO-xxx 


vector 




SIDTmem48 


OFh 


01 h 


mm-OOl-xxx 


vector 




SHiySAL mregS, immS 


COh 




ll-lOO-xxx 


short 


alux 


SHL/SAL memS, immS 


COh 




mm-lOO-xxx 


vector 




SHLySALmreg16/32jmm8 


Clh 




ll-lOO-xxx 


short 


atu 


SHL/SAL mem 16/32, immS 


Clh 




mm-lOO-xxx 


vector 




SHI/SAL mregS J 


DOh 




n-lOO-xxx 


short 


alux 


SHL/SAL mem8,l 


DOh 




mm-lOO-xxx 


vector 




SHL/SAL mregl6/32,l 


D\h 




11-lDO-XXX 


short 


alu 
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Table 12. Integer Instructions (continued) 



Instruction Mnemonic 


First 
Byte 


Second 
Byte 


ModR/M 
Byte 


Decode 
Type 


RISC86 
Opcodes 


SHiySAL meml6/32, 1 


Dlh 




mnvlOO-xxx 


vector 




SHLySALmreg8,CL 


D2h 




11-lOO-xxx 


short 


alux 


SHiySALmem8,CL 


D2h 




mnvlOO-xxx 


vector 




SHI/SAL mregl6/32, CL 


D3h 




11-lOO-xxx 


short 


atu 


SHiySAL meml6/32, CL 


D3h 




mm-lOO-xxx 


vector 




SHR mregS, immS 


COh 




11 -101 -XXX 


short 


alux 


SHR memS, immS 


COh 




mm-lOl-xxx 


vector 




SHRmreglV32jmin8 


Clh 




IMOl-xxx 


short 


alu 


SHRmem16/32jmm8 


Clh 




mm-lOl-xxx 


vector 




SHRmreg8.1 


DOh 




11-101-xxx 


short 


alux 


SHR mem8, 1 


DOh 




mm-lOl-xxx 


vector 




SHR mregl6/32, 1 


Dlh 




11-101-xxx 


short 


alu 


SHRmeml6/32J 


Dlh 




mm-lOl-xxx 


vector 




SHRmreg8,CL 


D2h 




11-101-xxx 


short 


alux 


SHR mem8, CL 


D2h 




mm-lOl-xxx 


vector 




SHR mregl6/32, CL 


D3h 




11-101-xxx 


short 


alu 


SHRmeml6/32, CL 


D3h 




mm-lOl-xxx 


vector 




SHLD mregl6/32, regl6/32, ImmS 


OFh 


A4h 


n-xxx-xxx 


vector 




SHLD meml6/32, regl6/32, immS 


OFh 


A4h 


mm-xxx-xxx 


vector 




SHLD mregl6/32, regi6/32, CL 


OFh 


A5h 


ll-xxx-xxx 


vector 




SHLD meml6/32, regl6/32, CL 


OFh 


A5h 


mm-xxx-xxx 


vector 




SHRD mregl6/32, regl6/52. imm8 


OFh 


ACh 


ll-xxx-xxx 


vector 




SHRD meml6/32, regl6/52, immS 


OFh 


ACh 


mm-xxx-xxx 


vector 




SHRD mregl6/32, regl6/32, CL 


OFh 


ADh 


ll-xxx-xxx 


vector 




SHRD meml6/32, regl6/32, CL 


OFh 


ADh 


mm-xxx-xxx 


vector 




SLOT mregl6 


OFh 


OOh 


U-OOO-xxx 


vector 




SLOT memie 


OFh 


OOh 


mm-OOO-xxx 


vector 




SMSW mregie 


OFh 


01 h 


11-lOO-xxx 


vector 




SMSW meml6 


OFh 


01 h 


mm-lOO-xxx 


vector 




STC 


F9h 






vector 




STD 


FDh 






vector 




STl 


FBh 






vector 




STOSB memS^AL 


AAh 






long 


store, alux 
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Table 12. Integer Instructions (continued) 



Instruction Mnemonic 


First 


Second 


ModR/M 


Decode 


RISC86 


Byte 


Byte 


Byte 


Type 


Opcodes 


STOSWmemlS, AX 


ABh 






long 


store, alu 


STOSD mem32, EAX 


ABh 






long 


store, alu 


STR mreglG 


OFh 


OOh 


n-001-xxx 


vector 




STR meml6 


OFh 


ooh 


mm-COl-xxx 


vector 




SUB mregS, reg8 


28h 




n-xxx-xxx 


short 


alux 


SUB memS, regS 


28h 




mm-xxx-xxx 


long 


toad, alux, store 


SUB mregl6/52, regl6/32 


29h 




11-xxx-xxx 


short 


alu 


SUBmeml6/32, rcgl6/32 


29h 




mm-xxx-xxx 


long 


load, alu, store 


SUB reg8, mregS 


2Ah 




ll-xxx-xxx 


short 


alux 


SUB regB, memS 


2Ah 




mm-xxx-xxx 


short 


toad, alux 


SUB regl6/32, mregl6/32 


2Bh 




n-xxx-xxx 


short 


alu 


SUBregi6/32, meml6/32 


2Bh 




mm-xxx-xxx 


short 


load, alu 


SUB AL, imms 


2Ch 




xx-xxx-xxx 


short 


alux 


SUB EAX, imml6/32 


2Dh 




xx-xxx-xxx 


short 


alu 


SUB mregS, immS 


80h 




IMOl-xxx 


short 


alux 


SUB memS, iminB 


80h 




mm-lOl-xxx 


long 


load, alux. store 


SUB mreg16/32, imml6/32 


81 h 




IMOl-xxx 


short 


alu 


SUBmem16/32, imm16/32 


81 h 




mm-lOl-xxx 


long 


load, alu, store 


SUB mregl6/32, immS (signed ext.) 


83h 




11-101-xxx 


short 


alux 


SUB meml6/52, immS (signed ext) 


83h 




mm-101-xxx 


long 


load, alux, store 


SYSCALL 


OFh 


05h 




vector 




SYSRET 


OFh 


07h 




vector 




TEST mregS, regS 


84h 




ll-xxx-xxx 


short 


alux 


TEST memS, regS 


84h 




mm-xxx-xxx 


vector 




TEST mregl6/32, regl6/32 


85h 




ll-xxx-xxx 


short 


alu 


TEST mem 16/32, regl6/32 


85h 




mm-xxx-xxx 


vector 




TEST AL, immS 


ABh 






long 


alux 


TEST EAX, Imm 16/32 


A9h 






long 


alu 


TEST mregS, immS 


F6h 




U-OOO-xxx 


long 


alux 


TtST memS, immS 


F6h 




mm-OOO-xxx 


long 


toad, alux 


TEST mreg16/32, imml6/32 


F7h 




ll-OOO-xxx 


long 


alu 


TEST mem 16/32, imm16/32 


F7h 




mm-OOO-xxx 


long 


load, alu 


VERRmregie 


OFh 


OOh 


IMOO-xxx 


vedor 
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Table 12. Integer Instructions (continued) 



Instruction Mnemonic 


Hrst 


Second 


ModR/M 


Decode 


RISC86 


Byte 


Byte 


Byte 


Type 


Opcodes 


VERR meml6 


OFh 


OOh 


mm-100-xxx 


vector 




VERWmregie 


OFh 


OOh 


IMOl-m 


vector 




VERWmemie 


OFh 


OOh 


mm-lOl-xxx 


vector 




WAIT 


9Bh 






vector 




WBINVD 


OFh 


09h 




vector 




XADD mreg8, regS 


OFh 


COh 


1M00-XXX 


vector 




XADD memS, regS 


OFh 


COh 


mm-100-xxx 


vector 




XADDmreg16/32,reg16/32 


OFh 


Clh 


1M01-XXX 


vector 




XADD meml6/32, regl6/32 


OFh 


Clh 


mm-lOl-xxx 


vector 




XCHC regS, mregS 


86h 




11-XXX'XXX 


vector 




XCHG regS, mem8 


86h 




mm-m-xxx 


vector 




XCHC regl6/32, mregl6/32 


87h 




11-xxx-xxx 


vector 




XCHG regl6/32, mem 16/32 


87h 




mnvxxx-xxx 


vector 




XCHGEAXEAX 


90h 






short 


limm 


XCHGEAX^Ea 


91 h 






long 


alu, ali^ alu 


XCHG EAX EDX 


92h 






long 


alu, alu, alu 


XCHGEAXEBX 


93h 






long 


alu, alu, alu 


XCHG EAX. ESP 


94h 






long 


alu, alu, alu 


XCHG EAX, EBP 


95h 






long 


alu, alu, alu 


XCHG EAX, ESf 


96h 






long 


alu, alu, alu 


XCHG EAX, EDI 


97h 






long 


alu, alu, alu 


XLAT 


D7h 






vector 




XOR mregS, regS 


30h 




11*XXX'XXX 


short 


alux 


XOR mem8, regS 


30h 




mm-xxx-xxx 


long 


load, ahix, store 


XOR mregl6/32, regl6/32 


31 h 




11-XXX-XXX 


short 


alu 


XOR mem 16/32, regl6/32 


31 h 




mm-xxx-xxx 


long 


load, alu, store 


XOR regS, mregS 


32h 




11-xxx-m 


short 


alux 


XOR regS, memS 


32h 




mm-xxx-xxx 


short 


load, alux 


XOR regl6/32, mreg16/32 


33h 




11-XXX-XXX 


short 


alu 


XOR regl6/32, meml6/32 


33h 




mm-xxx-xxx 


short 


load, alu 


XORAL, immS 


34h 




xx-xxx-xxx 


short 


alux 


XOREAX,imml6/32 


35h 




xx-xxx-xxx 


short 


alu 


XOR mreg8, immS 


80h 




1M10-XXX 


short 


alux 
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Instruction Mnemonic 


First 
Byte 


Second 

RvIa 


ModR/M 


Decode 
lype 


RISC86 
Opcodes 




XOR mem8, immS 


80h 






long 


load, alux, store 




XORmregl6/32,imml6/32 


81 h 




11 1 in vw 


short 


alu 




XOR mem16/32jmml6/32 


81 h 




miTi-i lU-xxx 


long 


load, alu, store 




XOR inregl6/32, immS (signed ext) 


83h 




111 ^t\ WW 


short 


alux 




XOR mem 1^32, immS (signed ext) 


83h 




mm-110*xxx 


Innn 

long 


load, alux. store 




Table 13. Floating-Point Instructions 














instruction Mnemonic 


First 
Byte 


Second 
Byte 


ModR/M 
Byte 


Decode 
Type 


R15C86 
Opcodes 


Note 


F2XM1 


D9h 


FOh 




short 


float 




FABS 


D9h 


Flh 




short 


float 




FADD ST(0), ST(i) 


D8h 




11-000-XXX 


short 


float 


* 


FADDST(O), mem32real 


D8h 




mm-OOO-xxx 


short 


ftoad, float 




FADDST(i),ST(0) 


DCh 




11-000-XXX 


short 


float 




FADDST(0),niem64real 


DCh 




mm-OOO-xxx 


short 


fload, float 




FADDPST(IXST(0) 


DEh 




U-OOO-xxx 


short 


float 


* 


FBLD 


DFh 




mm-lOO-xxx 


vector 




* 


FBSTP 


DFh 




mm-110-xxx 


vector 




* 


FCHS 


D9h 


EOh 




diort 


float 




FCLEX 


DBh 


E2h 




vector 






FCOMST(0),ST(i) 


D8h 




n-010-xxx 


short 


float 


* 


FCOM ST(0), mem32real 


D8h 




mm-OlO-xxx 


short 


fload, float 




FCDM ST(OX mem64real 


DCh 




mm-OlO-xxx 


short 


fload, float 




FCOMPST(0XST(i) 


D8h 




ll-OII-xxx 


short 


float 


* 


FCOMP5T{0), mem32real 


D8h 




mm-Oll-xxx 


short 


fload, float 




FCOMPST(O), mem64real 


DCh 




mm-Oll-xxx 


short 


fload, float 




KOMPP 


DEh 




11-011-001 


short 


float 




FCOSST(0) 


D9h 


FFh 




short 


float 




FDECSTP 


D9h 


F6h 




short 


float 




FDIV ST(0), ST(i) (single precision) 


D8h 




n-no-xxx 


short 


float 


* 


FDIV ST(0), ST(i) (double precision) 


D8h 




n-iio-xxx 


short 


float 


* 


Note: 

* The kjsi three bks of the modfVM bytesdea the stack entry ST(0. 
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Table 13. Floating-Point Instructions (continued) 



Instruction Mnemonic 


first 
Byte 


Second 
Byte 


ModR/M 
Byte 


Decode 
Type 


R1SC86 
Opcodes 


Note 


FDIV ST(0), ST(i) (extended precision) 


Ddh 




IMlO-xxx 


short 


float 


* 


FDIV ST(iX ST(0) (single precision) 


DCh 




11-111 -XXX 


short 


float 


« 


FDIV ST(i), ST(0) (double precision) 


DCh 




ll-in-xxx 


short 


float 


* 


FDIV ST(i), ST(0) (extended precision) 


DCh 




11-ni-xxx 


short 


float 


* 


FDIVST(0), mem32real 


D8h 




mm-llO-xxx 


short 


fload, float 




FDIV ST(0), mem64real 


DCh 




mm-IIO-xxx 


short 


fload, float 




FDlVPST(0),ST(i) 


DEh 




11-111-xxx 


short 


float 


* 


FDIVRST(OXST(i) 


D8h 




IMlO-xxx 


short 


float 


* 


FDIVRST(l),ST(0) 


DCh 




1M11-XXX 


short 


float 


* 


FDIVRST(O), mem32real 


D8h 




mm-111-xxx 


short 


fload, float 




FDIVRST(0), memMreal 


DCh 




mm-111-xxx 


short 


fload, float 




FDIVRPST(i),ST(0) 


DEh 




IMlO-xxx 


short 


float 


* 


FFREESTO) 


DDh 




n-ooo-xxx 


short 


float 


* 


FIADD ST(0), mem32int 


DAh 




mm-OOO-xxx 


short 


fload, float 




FIADD ST(0), memieint 


DEh 




mm-OOO-xxx 


short 


flodd, float 




nCOM ST(0), mem52int 


DAh 




mm-OlO-xxx 


short 


fload, float 




FICOM ST(0), memieint 


DEh 




mm-OlO-xxx 


short 


fload, float 




RCOMPSKO), mem32int 


DAh 




mm-011-xxx 


short 


fload, float 




RCOMPSKO), memieint 


DEh 




mm-OII-xxx 


short 


fload, float 




F1DIVST(0), mem32int 


DAh 




mm-110-xxx 


short 


fload, float 




RDIVST(0), memieint 


DEh 




mm-llO-xxx 


short 


fload, float 




HDIVRST(O), mem32int 


DAh 




mm-111-xxx 


short 


fload, float 




HDiVRST(O), memieint 


DEh 




mm-111-xxx 


short 


fload, float 




FILD memieint 


DFh 




mm-OOO-xxx 


short 


fload, float 




FILD mem32int 


DBh 




mm-OOO-xxx 


short 


fload, float 




FILD mem64int 


DFh 




mm-lOl-xxx 


short 


fload, float 




FIMUL ST(0). mem32int 


DAh 




mm*001-xxx 


short 


fload, float 




FIMULST(0), memieint 


DEh 




mm-001-xxx 


short 


fload, float 




FINCSTP 


D9h 


F7h 




short 


float 




HNIT 


DBh 


E3h 




vector 






nST memieint 


DFh 




mm-010-xxx 


short 


fload, float 




Note: 

* The hst Cftf ee bits of the modl^M byte sdea the stack entry STO). 
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Table 13. Floating-Point Instructions (continued) 



Instniction Mnemonic 


first 
Byte 


Second 
Byte 


ModR/M 
Byte 


Decode 
Type 


RISCB6 

Opcodes 


Note 


FIST mem32int 


DBh 




mm-OlO-xxx 


short 


fioad, float 




FISTP memieint 


DFh 




mm-Oll-xxx 


short 


noad, float 




FlSTPmem32int 


DBh 




mm-OU-xxx 


short 


fload« float 




RSTP mem64int 


DFh 




mm-lU-xxx 


short 


fload, float 




FISUBST(O), mem32mt 


DAh 




mm-lOO-xxx 


short 


fload, float 




FISUB ST(0), memieint 


DEh 




mm-lOO-xxx 


short 


Road, float 




FISUBR ST(0), meni32int 


DAh 




mm-lOl-xxx 


short 


fload, float 




FISUBRST{0), memieint 


DEh 




mm-lOl-xxx 


short 


flodd, float 




FLDST(i) 


D9h 




U-OOO-xxx 


short 


Head, float 




FLD mem32re9l 


D9h 




mm-OOO-xxx 


short 


fload, float 




FLD mem64real 


DDh 




mm-OOO-xxx 


short 


fload, float 




FLD memSOreal 


DBh 




mm-lOl-xxx 


vector 






FLDl 


D9h 


E8h 




short 


fload, float 




FLDCW 


D9h 




mm-101-xxx 


vector 






FLDENV 


D9h 




mm-lOO-xxx 


short 


fload, float 




FLDL2E 


D9h 


EAh 




short 


float 




FLDL2T 


D9h 


E9h 




short 


float 




FLDLG2 


D9h 


ECh 




short 


float 




FLDLN2 


D9h 


EDh 




short 


float 




FLDPI 


D9h 


EBh 




short 


float 




FLDZ 


D9h 


EEh 




short 


float 




FMULSm ST(0 


D8h 




11 -001 -XXX 


short 


float 




FIVIULST(i), ST(0) 


DCh 




ll-OOl-xxx 


short 


float 




FMULST(O), mem32real 


D8h 




mm-OOl-xxx 


short 


fload, float 




FMULSKO), mem&4real 


DCh 




mm-OOl-xxx 


short 


fload, float 




FMULPST(0),ST(i) 


DEh 




U-OOl-XXX 


short 


float 




FNOP 


D9h 


DOh 




short 


float 




FPATAN 


D9h 


F3h 




short 


float 




FPREM 


D9h 


F8h 




short 


float 




FPREMl 


D9h 


F5h 




short 


float 




FPTAN 


D9h 


F2h 




vector 






Note: 

* The lost three bhs of the modH/M b/te se/ert the stack entry STfi). 
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Table 13. Floating-Point Instructions (continued) 



Instruction Mnemonic 


First 
Byte 


Second 
Byte 


ModR/M 
Byte 


Decode 
Type 


RISC86 
Opcodes 


Note 


FRNDINT 


D9h 


FCh 




short 


float 




FRSTOR 


DDh 




mm-lOO-m 


vector 






FSAVE 


DDh 




mm-llO-xxx 


vector 






FSCALE 


D9h 


FDh 




short 


float 




FSIN 


D9h 


FEh 




short 


float 




FSINCOS 


D9h 


FBh 




vector 






FSQRT (single predsion) 


09h 


FAh 




short 


float 




FSQRT (double predsion) 


D9h 


FAh 




short 


float 




FSQRT (extended precision) 


D9h 


FAh 




short 


float 




FST mem32redl 


D9h 




mm-OlO-m 


short 


fstore 




FSTmemWreal 


DDh 




mm-010-xxx 


short 


fstore 




FSTST(i) 


DDh 




n-OlOxxx 


short 


fstore 




FSTCW 


D9h 




mm-lU-xxx 


vedor 






FSTiNV 


D9h 




mm>no-xxx 


vector 






FSTP mem32real 


D9h 




mm-on-xxx 


short 


fstore 




FSTP mem64real 


DDh 




mm-OU-xxx 


short 


toe 




FSTP memSOreal 


D9h 




mm-ni-xxx 


vector 






FSTP$T(i) 


DDh 




11-On-xxx 


short 


float 




FSTSWAX 


DFh 


EOh 




vector 






FSTSWmeime 


DDh 




mm-111-xxx 


vedor 






FSUBST{0),mem32real 


D8h 




mm>100-xxx 


short 


fload, float 




FSUBST(0Xmem64reai 


DCh 




mm-100-xxx 


short 


fload, float 




FSUBST(0),ST(i) 


D8h 




1M00-XXX 


short 


float 




FSUBST(i),ST(0) 


DCh 




11-101 -XXX 


short 


float 




FSUBPST(0),ST(l) 


DEh 




ll-lOl-xxx 


short 


float 




FSUBRST(OX mem32real 


D8h 




mm-IOl-xxx 


short 


fload, float 




FSUBRST(0), mem64real 


DCh 




mm-lOl-xxx 


short 


fload, float 




FSUBRST(0),ST(l) 


D8h 




IMOO-xxx 


short 


float 




FSUBRST(i), ST(0) 


DCh 




IMOl-xxx 


short 


float 




FSUBRP ST(0, ST(0) 


DEh 




11-100-xxx 


short 


float 




FIST 


D9h 


E4h 




short 


float 




Note: 

* Ihelast three bits of the modl^M byte selea the stack entry STQ. 
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Instniction Mneniank 


First 
Byte 


Second 
Byte 


ModR/M 
Byte 


Decode 
Type 


RISC86 
Opcodes 


Note 


FUCOM 


DDh 




IMOO-xxx 


short 


float 




FUCOMP 


DDh 




IMOl-xxx 


short 


float 




FUCOMPP 


DAh 


E9h 




short 


float 




FXAM 


D9h 


E5h 




short 


float 




FXCH 


D9h 




11-001 -XXX 


short 


float 




FXTRAQ 


D9h 


F4h 




vector 






FYL2X 


D9h 


Flh 




short 


float 




FYL2XP1 


D9h 


F9h 




short 


float 




FWAIT 


9Bh 






vector 






Note: 

* Thebst three bits of the modl^M byte select the stack entry 510). 



For more information about MMX instructions, see Appendix A, "MMX Multimedia 
Technology" on page 347. 



Table 14. MMX Instructions 



Instruction Mnemonic 


Prefix 
Byte(s) 


First 
Byte 


ModR/M 
Byte 


Decode 
Type 


RISC86 
Opcodes 


Note 


EMMS 


OFh 


77h 




vector 






MOVD nwnreg, mreg32 


OFh 


6Eh 


11-xxx-xxx 


short 


store, mioad 




MOVD mmreg, mem32 


OFh 


6Eh 


mm-xxx-xxx 


short 


mtoad 




MOVD mreg32, mmreg 


OFh 


7£h 


U-xxx-xxx 


short 


mstore, load 




MOVD mem32, mmreg 


OFh 


7Eh 


mm-xxx-xxx 


short 


mstore 




MOVQ mmregl, mmreg2 


OFh 


6Fh 


ll-xxx-xxx 


short 


mmx 




MOVQ mmreg, mem64 


OFh 


6Fh 


mm-xxx-xxx 


short 


mload, mIoad 




MOVQ mmregl, mmreg2 


OFh 


7Fh 


ll-xxx-xxx 


short 


mmx 




MOVQ mem64, mmreg 


OFh 


7Fh 


mm-xxx-xxx 


short 


mIoad, mstore 




PACKSSDW mmreg 1, mmreg2 


OFh 


6Bh 


n-xxx-xxx 


short 


mmx 




PACKSSDW mmreg, mem64 


OFh 


6Bh 


mm-xxx-xxx 


short 


mioad, mmx 




PACKSSWB mmregl, mmreg2 


OFh 


63h 


11-xxx-xxx 


short 


mmx 




RACKSSWB mmreg, mem64 


OFh 


64h 


mm-xxx-xxx 


short 


mIoad, mmx 




PACKUSWB mmregl, mmreg2 


OFh 


67h 


11-xxx-xxx 


short 


mmx 




Note: 

** Bits 2, 1 and Oof the modB/M byte sefect the integer register. 
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Table 14. MMX Instructions (contimjed) 



Instruction Mnemonic 


Prefix 

Byte($) 


First 
Byte 


ModR/M 
Byte 


Decode 
Type 


RISC86 

Opcodes 


Note 


PACKUSWB mmreg, mem64 


OFh 


67h 


mm-xxx-xxx 


short 


mioad, mmx 




PADDB mmregl, mmreg2 


OFh 


FCh 


11-xxx-xxx 


short 


mmx 




PADDB mmreg, mem64 


OFh 


FCh 


mm-xxx-xxx 


short 


mIoad, mmx 




PADDD mmregl mmreg2 


OFh 


FEh 


11-xxx-xxx 


short 


mmx 




PADDD mmreg, mem64 


OFh 


FEh 


mm-xxx-xxx 


short 


mIoad, mmx 




PADDSB mmregl, mmreg2 


OFh 


ECh 


11-xxx-xxx 


short 


mmx 




PADDSB mmreg, mem64 


OFh 


ECh 


mm-xxx-xxx 


short 


mload, mmx 




PADDSW mmregl, mmreg2 


OFh 


EDh 


n-xxx-xxx 


short 


mmx 




PADDSW mmreg, mem64 


OFh 


EDh 


mnvxxx-xxx 


short 


mload, mmx 




PADDUSB mmregl, mmregl 


OFh 


DCh 


11-xxx-xxx 


short 


mmx 




PADDUSB mmreg, mem64 


OFh 


DCh 


mm-xxx-xxx 


short 


mload, mmx 




PADDUSW mmregl mmreg2 


OFh 


DDh 


11-xxx-xxx 


short 


mmx 




PADDUSW mmreg, mem64 


OFh 


DDh 


mm-xxx-xxx 


short 


mload, mmx 




PAODW mmregl mmreg2 


OFh 


FDh 


11-xxx-xxx 


short 


mmx 




PADDW mmreg, mem64 


OFh 


FDh 


mm-xxx-xxx 


short 


mload, mmx 




PAND mmregl mmreg2 


OFh 


DBh 


11-xxx-xxx 


short 


mmx 




PAND mmreg, mem64 


OFh 


DBh 


mm-xxx-xxx 


short 


mload, mmx 




PANDN mmregl mmreg2 


OFh 


DFh 


11-xxx-xxx 


short 


mmx 




PANDN mmreg, mem64 


OFh 


DFh 


mm-xxx-xxx 


short 


mload, mmx 




PCMPEQB mmregl mmreg2 


OFh 


74h 


11-xxx-xxx 


short 


mmx 




PCMPEQB mmreg, mem64 


OFh 


74h 


mm-xxx-xxx 


short 


mload, mmx 




PCMPEQD mmregl mmreg2 


OFh 


76h 


11-xxx-xxx 


short 


mmx 




PCMPEQD mmreg, menii64 


OFh 


76h 


mm-xxx-xxx 


short 


mload, mmx 




PCMPEQW mmregl mmreg2 


OFh 


75h 


11-xxx-xxx 


short 


mmx 




PCMPEQW mmreg. mem64 


OFh 


75h 


mm-xxx-xxx 


short 


mload, mmx 




PCMPGTB mmregl mmreg2 


OFh 


64h 


11-xxx-xxx 


short 


mmx 




PCMPGTB mmreg, mem64 


OFh 


64h 


mm-xxx-xxx 


short 


mload, mmx 




PCMPCTD mmregl mmreg2 


OFh 


66h 


11-xxx-xxx 


short 


mmx 




PCMPCTD mmreg, mem64 


OFh 


66h 


mm-xxx-xxx 


short 


mload, mmx 




PCMPC7W mmregl mmreg2 


OFh 


65h 


11-xxx-xxx 


short 


mmx 




PCMPGTW mmreg, mem64 


OFh 


65h 


mm-xxx-xxx 


short 


mload, mmx 




Note: 

** BHslhandO of the modf^M byte select the integer register. 
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Table 14. MMX Instructions (continued) 



Instruction Mnemonic 


Prefix 
Byte(s) 


First 
Byte 


ModR/M 
Byte 


Decode 
Type 


RISC86 
Opcodes 


Note 


PMADDWD mmregl nimreg2 


OFh 


F5h 


11-XXX-XXX 


short 


mmx 




PMADDWD mmreg, mem64 


OFh 


F5h 


mm-xxx-xxx 


short 


mioad, mmx 




PMULHW mmregl, mmreg2 


OFll 


E5h 


n-m-xxx 


short 


mmx 




PMULHW mmreg, mem64 


OFh 


E5h 


mm-xxx-xxx 


short 


mioad, mmx 




PMULLW mmregl, mmreg2 


OFh 


D5h 


n-xxx-xxx 


short 


mmx 




PMULLW rnmreg, mem64 


OFh 


D5h 


mm-xxx-xxx 


short 


mioad, mmx 




POR mmregl, mmreg2 


OFh 


EBh 


U-xxx-xxx 


short 


mmx 




POR mmreg, mem64 


OFh 


EBh 


mm-xxx-xxx 


short 


mioad, mmx 




PSLLW mmregl mfnreg2 


OFh 


Flh 


11-xxx-xxx 


short 


mmx 




PSLLWmmreg, mem64 


OFh 


Flh 


U-xxx-xxx 


short 


mioad, mmx 




PSLLW mmreg, imm8 


OFh 


71 h 


n-no-xxx 


short 


mmx 




PSLLD mmregl, mmreg2 


OFh 


F2h 


11-xxx-xxx 


short 


mmx 




PSLLD mmreg, mem64 


OFh 


F2h 


11-xxx-xxx 


short 


mloatl, mmx 




PSLLD mmreg, immS 


OFh 


72h 


IMlO-xxx 


short 


mmx 




PSLLQ mmregl, mmreg2 


OFh 


F3h 


11-xxx-xxx 


short 


mmx 




PSLLQ mmreg, mem64 


OFh 


F3h 


11-xxx-xxx 


short 


mioad, mmx 




PSLLQ mmreg, hnmS 


OFh 


73h 


n-iio-xxx 


short 


mmx 




PSRAW mmregl, mmreg2 


OFh 


Elh 


n-xxx-xxx 


short 


mmx 




PSRAW mmreg, mem64 


OFh 


Elh 


u-xxx-xxx 


short 


mioad, mmx 




PSRAW mmreg, immS 


OFh 


71 h 


11-lOO-xxx 


short 


mmx 




PSRAD mmregl, mmreg2 


OFh 


E2h 


11-xxx-xxx 


short 


mmx 




PS RAD mmreg, mem64 


OFh 


E2h 


u-xxx-xxx 


short 


mioad, mmx 




PSRAD mmreg, immS ^ 


OFh 


72h 


IMOO-xxx 


short 


mmx 




PSRAQ mmregl, mmreg2 1 


OFh 


E3h 


u-xxx-xxx 


short 


mmx 




PSRAQ mmreg, mem64 


OFh 


E3h 


u-xxx-xxx 


short 


mioad, mmx 




PSRAQ mmreg, imm8 


OFh 


73h 


U-lOO-xxx 


short 


mmx 




PSRLW mmregl, mmreg2 


OFh 


01 h 


u-xxx-xxx 


short 


mmx 




PSRLW mmreg, mem64 


OFh 


Dih 


u-xxx-xxx 


short 


mioad, mmx 




PSRLW mmreg, imm8 


OFh 


71 h 


U-OlO-xxx 


short 


mmx 




PSRLD mmregl, mmreg2 


OFh 


D2h 


11-xxx-xxx 


short 


mmx 




PSRLD mmreg, mem64 


OFh 


D2h 


u-xxx-xxx 


short 


mioad, mmx 




Note: 

** Bits 2, 1 and 0 of the modf^M byte select the integer register. 
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Table 14. MMX Instructions (continued) 



Instruction Mnemonic 


Prefix 


First 
Byte 


ModlVlM 


Decode 


RiSC86 


Note 


PSRLD mmrec imms 


OFh 


72h 


1 1 U 11/^ AAA 


short 


mmx 




P^RtOmmrpol mmrpo'7 


urn 


n^h 


1 1 -VYV-WY 
1 1 aAa AAA 


chnrt 


mmv 

IlilllA 




PSRLOmmrpp mpmU. 


OFh 


D3h 


1 1 -WV-VYY 
1 1 AAA AAA 


^hnrt 
911UI i 


mln;irl mmv 




P^RI n mmfpo immft 
r jiiLi^ iiiiiiicjjf iiiiiiio 


nrh 

urn 


7^h 
/jn 


1 |-UllrAAA 


short 


mmx 




rJUOD IIHIIICJJIf UlllllCJ^ 


urn 


ron 


1 1 -XXX AAA 


short 


mmx 




rJUDD fllllllC^ illCllloH 


urn 


ron 


fv^m wv wv 

rnrn-AAX-xxA 


short 


ml/varl mmv 

uiiOda, rnriiA 




rvvDL/ 1111111^51, III11IIC5Z 


urn 


rah 
rMn 


1 1 -xxx-xxx 


short 


mmx 




rdUDU iiiiiircg, mcniD*f 


Urn 


PAh 

mn 


mm-xxx-xxx 


short 


miOdu, mmx 




r^uDjD iiiiTircgi, mrnrcgz 


Urn 




1 1 -XXX-XXX 


short 


mmx 




rdUD^D iiiinrcgf rnciTiOH 


Urn 


Pah 
con 


mm-xxx-xxx 


short 


miOdu, mmx 




P^l IR^\A/ mmroal mmraat 

rjuD^w inmicjji, rniTircgA 


Urn 


POh 


1 1 -xxx-xxx 


short 


mmx 




r jviDjvv niniic^ iiicnio'^ 


Urn 


POh 


mm-xxx-xxx 


short 


mioou, mmx 




rjUDUjD niiiiicgi, iTiinreg/ 


urn 


noh 
uon 


1 1 -xxx-xxx 


short 


mmx 




PCI IRI KQ mmracT fnAmCVI 

r judujd rnmrcg, iricinD*f 


Urn 


nch 


mm-xxx-xxx 


short 


mioao, mmx 




DCI IRI K\A/ tnmrnal mmrAn^ 
rDUDUjW MlFTiregl, niTTircgZ 


urn 


noh 
uyn 


1 1 -xxx-xxx 


short 


mmx 




DCI IRI ICIA/ mm ran mamOl 

r jUdUjW minrcg, nicmo<f 


Urn 


uyn 


mm-xxx-xxx 


short 


miodo, mmx 




rjuow mmregi, mfnrcgz 


nek 
Urn 


CQh 

ryn 


11 -xxx-xxx 


short 


mmx 




rjuDW minreg, memM 


nCK 
Ohn 


Coh 


mm-xxx-xxx 


short 


miodd, mmx 






nek 
Urn 


COh 

Don 


11 -xxx-xxx 


short 


mmx 




DIIMPri^UaiAl mmrAn momC/l 


Orn 


Don 


mm-xxx-xxx 


short 


mioda, mmx 




rUINrUMIWU ITIIiiregl, lilmrcgiE 


Urn 


oyn 


11 -xxx-xxx 


short 


mmx 




rurvrLivnvYU iTimreg, memo** 


Orn 


COk 

oyn 


mm-xxx-xxx 


short 


miodd, mmx 




ruiMrLKnUvi miTiregi, mmregz 


nek 

Urn 


CAk 

oAn 


11 -xxx-xxx 


snort 


mmx 




PUNPCKHDQ mmreg. mem64 


OFh 


6Ah 


mm-xxx-xxx 


short 


mioad, mmx 




PUNPCKLBW mmregi. mmreg2 


OFh 


60h 


11 -xxx-xxx 


short 


mmx 




PUNPCKLBW mmreg, mem64 


OFh 


60h 


mm-xxx-xxx 


short 


mioad, mmx 




PUNPCKLWD mmregi, mmreg2 


OFh 


61 h 


11 -xxx-xxx 


short 


mmx 




PUNPCKLWD mmreg, mem64 


OFh 


61 h 


mm-xxx-xxx 


short 


mIoad, mmx 




PUNPCKLDQ mmregi, mmreg2 


OFh 


62h 


11 -xxx-xxx 


short 


mmx 




PUNPCKLDQ mmreg, mem64 


OFh 


62h 


mm-xxx-xxx 


short 


mIoad, mmx 




PXOR mmregi, mmreg2 


OFh 


EFh 


11 -xxx-xxx 


short 


mmx 




PXOR mmreg, mem64 


OFh 


EFh 


mm-xxx-xxx 


short 


mIoad, mmx 




Note: 

** Bits 2, 1 andO of the modR/M byte select the integer register. 
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For more information about 3D technology and instructions, see Chapter 4, 
Technology" on page 81. 



Tablets. 3D Instructions 



Instnxtion Mnemonic 


Prefix 


Opcode 


ModR/M 


Decode 


RISCB6® 


Note 


Byte($) 


Byte 


Byte 


Type 


Opcodes 


FEMMS 


OFh 


OEh 




vector 






PAVGUSB mmregl, mmreg2 


OFh, OFh 


BFh 


n-xxx-xxx 


short 


mmx 


3 


PAVCUSB mmreg, mem64 


OFh, OFh 


BFh 


mm-xxx-xxx 


short 


mload, mmx 


3 


PFADD mmregl, mmreg2 


OFh, OFh 


9Eh 


n-xxx-xxx 


short 


3D 




PFAOO mmreg, mem64 


OFh. OFh 


9Eh 


mm-xxx-xxx 


short 


mload, 3D 




PFSUB mmregl, mmreg2 


OFh, OFh 


9Ah 


11-xxx-xxx 


short 


3D 




PFSUB mmreg, mem64 


OFh, OFh 


9Ah 


mm-xxx-xxx 


short 


mtoad, 3D 




PFSUBR mmregl, mmreg2 


OFh, OFh 


AAh 


11-xxx-xxx 


short 


3D 




PFSUBR mmre^ mem64 


OFh, OFh 


AAh 


mm-xxx-xxx 


short 


mload, 3D 




PFACC mmregl, mmreg2 


OFh, OFh 


AEh 


11-xxx-xxx 


short 


3D 




PFACC mmreg, mem64 


OFh, OFh 


AEh 


mm-xxx-xxx 


short 


mload, 3D 




PFMUL mmregl, mmreg2 


OFh, OFh 


B4h 


11-xxx-xxx 


short 


3D 




PFMUL mmreg, mem64 


OFh, OFh 


B4h 


mm-xxx-xxx 


short 


mload, 3D 




PFCMPGE mmregl, mmreg2 


OFh, OFh 


90h 


11-xxx-xxx 


short 


3D 




PFCMPGE mmreg, mem64 


OFh, OFh 


90h 


mm-xxx-xxx 


short 


mload, 3D 




PFCMPGT mmregl, mmreg2 


OFh, OFh 


AOh 


11-xxx-xxx 


short 


3D 




PFCMPCT mmreg, mem64 


OFh, OFh 


AOh 


mm-xxx-xxx 


short 


mload, 3D 




PFCMPEQ mmregl, mmreg2 


OFh, OFh 


BOh 


11-xxx-xxx 


short 


3D 




PFCMPEQ mmreg, mem64 


OFh, OFh 


BOh 


mm-xxx-xxx 


short 


mload, 3D 




PFMIN mmregl mmreg2 


OFh, OFh 


94h 


11-xxx-xxx 


short 


30 




PFMIN mmreg, mem64 


OFh, OFh 


94H 


mm-xxx-xxx 


short 


mload, 3D 




PFMAX mmregl, mmreg2 


OFh, OFh 


AAh 


n-xxx-xxx 


short 


3D 




PFMAX mmreg, mem64 


OFh, OFh 


A4h 


mm-xxx-xxx 


short 


mload, 3D 




PI2FD mmregl, mmreg2 


OFh, OFh 


ODh 


11-xxx-xxx 


short 


3D 




PI2FD mmreg, mem64 


OFh, OFh 


ODh 


mm-xxx-xxx 


short 


mload, 3D 




PF2ID mmregl, mmreg2 


OFh, OFh 


IDh 


11-xxx-xxx 


short 


3D 





I For PREFETCH and PREFETCHW, tfre mem8 value refers to a byte address within the S^-byte tine that will t)e 
prefetctied. 

Z PREFETCHW YviU 6e implemented in a future K86 processor. On the AMD-K6 3D processor, this instnjdon performs 
in the same manner as the PREFETCH instmaion. 

5. The byte Ssted in the column dtfed 'First Byte' is actually the immediate byte placed at the end of the instruction. 
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Table 15. 3D Instructions (continued) 



Instruction Mnemonic 


Prefix 


Opcode 

Rvtp 


ModiVM 


Decode 

lype 


Riscse*"' 

Opcodes 


Note 


PF2ID ininreg, mem64 


Orn, Um 


lun 


mm-xxx-xxx 


short 


miodd, Mj 




PFRCP mmregl, mmreg2 


Orn, Orn 


96n 


11'XXX-XXX 


short 


30 




PFRCP mmreg, mem64 


OFh, OFh 


96h 


mm-xxx'xxx 


short 


mload, 30 




PFRSQRT mmregl, mmreg2 


OFn, OFn 


97n 


U-xxx-xxx 


snort 


30 




PFRSQRT mmreg. meTn54 


OFh, OFh 


97h 


mm-xxx-xxx 


short 


mload, 3D 




PFRCPFTl mmregl, mnireg2 


OFh, OFh 


A6h 


11-xxx-xxx 


short 


3D 




PFRCPm mmreg, mem64 


OFh. OFh 


A6h 


mm-xxx-xxx 


short 


mload, 3D 




PFRSQITl mmreg 1, mmreg2 


OFh. OFh 


A7h 


11-xxx-xxx 


short 


3D 




PFRSQm mmreg, mem64 


OFh, OFh 


A7h 


mm-xxx-xxx 


short 


mload, 3D 




PFRCPrT2 mmregl, mmreg2 


OFh, OFh 


B6h 


11-xxx-xxx 


short 


3D 




PFRCPIT2 mmreg, mem64 


OFh, OFh 


B6h 


mm-xxx-xxx 


short 


mload, 3D 




PMULHRW mmregl, mmreg2 


OFh, OFh 


B7h 


11-xxx-xxx 


short 


mmx 


3 


PMULHRW mmregl mem64 


OFh. OFh 


B7h 


mm-xxx-xxx 


short 


mload, mmx 


3 


PREFETCH mem 


OFh 


ODh 


mm-OOO-xxx 


vector 


load 


1 


PREFETCHWmem 


OFh 


ODh 


mm-OOl-xxx 


veaor 


load 


1,2 


Notes: 

I For PREFETCH and PREFETCHW, the mem8 value ref&s to a byte address within the 32-byte &r)e that will be 
prefetched. 

2. PREFETCHW will be implemented in a future K86 processor. On the AMD-K6 3D processor, this instruction performs 
in the same manner cs the PREFETCH instruction. 

3. The byte listed in the column titled Tirst Byte" is aoually the immediate byte placed at the end of the instruction. 



80 



177AMD0060116 



4 

3D Technology 



Introduction 



3D technology is a significant innovation to the x86 architecture 
that drives today's personal computers. 3D technology is a 
group of new instructions that opens the traditional processing 
bottlenecks for floating-point-intensive and multimedia 
applications. With 3D technology, hardware and software 
applications can implement more powerful solutions to create a 
more entertaining and productive PC platform. Examples of the 
type of improvements that 3D technology enables are faster 
frame rates on high-resolution scenes, much better physical 
modeling of real-world environments, sharper and more 
detailed 3D imaging, smoother video playback, and near 
theater-quality audio. 

3D technology was defined and implemented in collaboration 
with independent software developers, including operating 
system designers, application developers, and graphics 
vendors. It is compatible with today's existing x86 software and 
requires no operating system support, thereby enabling 3D 
applications to work with all existing operating systems. 
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Key Functionality 



The 3D technology instructions are intended to open a major 
processing bottleneck in a 3D graphics application — 
floating-point operations. Today's 3D applications are facing 
limitations due to the fact that only one floating-point 
execution unit exists in the most advanced x86 processors. The 
front end of a typical 3D graphics software pipeline performs 
object physics, geometry transformations, clipping, and 
lighting calculations. These computations are very 
floating-point intensive and often limit the features and 
functionality of a 3D application. The source of performance for 
the 3D instructions originates from the single instruction 
multiple data (SIMD) implementation. With SIMD, each 
instruction not only operates on two single-precision, 
floating-point operands, but the microarchitecture within the 
processor can execute up to two 3D instructions per clock 
through two register execution pipelines, which allows for a 
total of four floating-point operations per clock. In addition, 
because the 3D instructions use the same floating-point 
registers as the MMX technology instructions, task switching 
between MMX and 3D operations is eliminated. For more 
information about MMX instructions, see Appendix A, '^MMX 
Multimedia Technology" on page 347. 

The 3D technology instruction set contains 21 instructions that 
support SIMD floating-point operations and includes SIMD 
integer operations, data prefetching, and faster 
MMX-to-floating-point switching. To improve MPEG decoding, 
the 3D instructions include a specific SIMD integer instruction 
created to facilitate pixel-motion compensation. Because 
media-based software typically operates on large data sets, the 
processor often needs to wait for this data to be transferred 
from main memory. The extra time involved with retrieving this 
data can be avoided by using the new 3D instruction called 
PREFETCH. This instruction can ensvu-e that data is in the level 
1 cache when it is needed. To improve the time it takes to switch 
between MMX and x87 code, the 3D instructions include the 
FEMMS (fast entry/exit multimedia state) instruction, which 
eliminates much of the overhead involved with the switch. The 
addition of 3D technology expands the capabilities of the 
AMD-K6 family of processors and enables a new generation of 
enriched user applications. 
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3D Feature Detection 




To properly identify and use the 3D instructions, the 
application program must determine if the processor supports 
them. The CPUID instruction gives programmers the ability to 
determine the presence of 3D technology on a processor. 
Software applications must first test to see if the CPUID 
instruction is supported. For a detailed description of the 
CPUID instruction, see Appendix C, "AMD Processor 
Recognition" on page 505. 

The presence of the CPUID instruction is indicated by the ID 
bit (21) in the EFLAGS register. If this bit is writable, the 
CPUID instruction is supported. The following code sample 
shows how to test for the presence of the CPUID instruction. 



pushfd 

pop eax 

mov ebx. eax 

xor eax, 00200000h 

push eax 

popfd 

pushfd 

pop eax 

cmp eax» ebx 

Jz N0_CPUID 



save EFLAGS 

store EFLAGS in EAX 

save in EBX for later testing 

toggle bit 21 

put to stack 

save changed EAX to EFLAGS 
push EFLAGS to TOS 
store EFLAGS in EAX 
see if bit 21 has changed 
if no change, no CPUID 



Once the software has identified the processor's support for 
CPUID, it must test for extended functions by executing 
extended function SOOO^OOOh (EAX=8000_000h). The EAX 
register returns the largest extended function input value 
defined for the CPUID instruction on the processor. If the value 
is not zero, extended functions are supported. 

The next step is for the programmer to determine if the 3D 
instructions are supported. Extended function 8000_0001h 
(EAX=8000_0001h) of the CPUID instruction provides this 
information. Extended function 8000_0001h returns the feature 
bits in the EDX register. If bit 31 in the EDX register is set to 1, 
3D instructions are supported. The following code sample 
shows how to test for 3D instruction support. 

mov eax, 8000_0001h ; setup extended function 1 

CPUID : call the function 

test edx. 8000_0000h ; test 31st bit 

jnz YES_3D : 3D technology supported 
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3D Register Set 



The complete multimedia units in the AMD-K6 3D processor 
combine the existing MMX instructions with the new 3D 
instructions. In addition, by merging 3D with MMX, it becomes 
possible to write x86 programs containing both integer, MMX, 
and floating-point graphics instructions with no performance 
penalty for switching between the MMX (integer) and 3D 
(floating-point) units. 

The processor implements eight 64-bit 3D/MMX registers. 
These registers are mapped onto the floating-point registers. As 
shown in Figure 46, the 3D and MMX instructions refer to these 
registers as mmregO to mmreg7. Mapping the new 3D/MMX 
registers onto the floating-point register stack enables 
backwards compatibility for the register saving that must occur 
as a result of task switching. 



TAG BITS 53 



mmO 



mml 



mm2 



mm3 



mm4 



mmS 



mm6 



mm7 



Figure 46. 3D/MIVK Registers 

Aliasing the 3D/MMX registers onto the floating-point register 
stack provides a safe method to introduce 3D and MMX 
technology, because it does not require modifications to 
existing operating systems. Instead of requiring operating 
system modifications, new 3D and MMX applications are 
supported through device drivers, 3D and MMX libraries, or 
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Dynamic Link Library (DLL) files. For more information, see 
"MMX/3D Registers'' on page 31. 

Current operating systems have support for floating-point 
operations and the floating-point register state. Using the 
floating-point registers for 3D and MMX code is a convenient 
way of implementing non-intrusive support for 3D and MMX 
instructions. Every time the processor executes an 3D or MMX 
instruction, all the floating-point register tag bits are set to zero 
(00b=valid), except for the FEMMS and EMMS instructions, 
which set all tag bits to one (llb=empty). 

Note: Executing the PREFETCH instruction does not change the 



3D technology uses a packed data format. The data is packed in 
a single, 64-bit 3DyMMX register or a quadword memory 
operand. For more information, see "3D Data Types" on 
page 32. 

Figure 47 shows the 3D floating-point data type. DO and Dl each 
hold an IEEE 32-bit single-precision, floating-point doubleword. 

(32 bits X 2) Two packed, single-precision, floating-point doubtewords 

63 3231 0 



Figure 47. 3D Data Type Details 

Figure 48 on page 86 shows the format of the IEEE 32-bit, 
single-precision, floating-point format. 



tag bits. 



3D Data Type 



Details 



Dl 



DO 
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32-bit. single-precision, floating-point doubleword 
31 23 22 



Biased Exponent 



SignificBnd 



Value definitions 

1. X-(-l)^*OBiased £xponent-0 

2. X=(-1)^*2'^^^^^^^ Exponent - l??)^^^^^-^^ 

nent<FFh 

3. X=Undef 1nedB1 ased Exponent=FFh 



SignificandO<Biased Expo- 



X is the value of the 32-bit, single-precision, floating-point doubleword. 
Figure 48. Sin^e-Predsion, Floating-Point Data Format 
Figure 49 shows the formats for the integer data types. 



(8 bits X 8) Packed bytes 

63 5655 4847 4039 3251 2423 1615 87 



B7 


B6 


B5 


B4 


B3 


B2 


Bl 


BO h 



(16 bits X 4) Packed words 
63 48 47 



3231 



1615 



W5 



W2 



Wl 



WO 



(32 bits X 2) Packed double words 
63 



3231 



Dl 



DO 



(54 bitsxl)Quadword 
63 



QO 



Figure 49. Integer Data Types 
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3D Instruction Rumats 



The format of 3D instruction encodings is based on the 
conventional x86 modR/M instruction format and is similar to 
the format used by MMX instructions (see "MMX Instruction 
Formats" on page 354), The assembly language syntax used for 
the 3D instructions is as follows: 

3D Mnemonic mmregl, mmreg2/mem64 

The destination and sourcel operand (mmregl) must be an 
MMX register (mm0-mm7). The sourceZ operand 
(mmreg2/mem64) can be either an MMX register or a 64-bit 
memory value. 

The encoding uses the opcode prefix OFh followed by a second 
opcode byte of OFh. To differentiate the various 3D 
instructions, a third instruction suffix byte is used. This suffix 
byte occupies the same position at the end of a 3D instructions 
as would an immS byte. The opcode format is as follows: 

OFh OFh modR/M [sib] [displacement] 3Djsuffix 

The specific operands (mmregl and mmreg2/mem64) 
determine the values used in modR/M [sib] [displacement], and 
follow conventional x86 encodings. The 3D suffix is determined 
by the actual 3D instruction. The 3D suffixes are defined in 
Table 17 on page 92. 

As an example, the 3D PFMUL instruction can produce the 
following opcodes, depending on its use: 

Opcode Instruction 

OF OF CA B4 PFMUL mml . mm2 

OF OF OB B4 PFMUL mml , [ebx] 

OF OF 4B OA 84 PFMUL mml . [ebx+lO] 

26 OF OF OB B4 PFMUL mml. esiCebx] 

OF OF 4C 83 OA 84 PFMUL mml, [ebx4-eax*4+iO] 

The encoding of the two performance-enhancement 
instructions (FEMMS and PREFETCH) uses a single opcode 
prefix OFh. The details of the opcodes for these instructions are 
shown on pages 96 and 134 respectively. 
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3D Definitions 



3D technology provides 21 additional instructions to support 
high-performance 3D graphics and audio processing, 3D 
instructions are vector instructions that operate on 64-bit 
registers. 3D instructions are SIMD — operating on eight 8-bit 
operands, four 16-bit operands, or two 32-bit operands. 

The definitions for the 3D instructions starting on page 95 
contain designations for the classification of the instruction as 
vectored or scalan Vector instructions operate in parallel on 
two sets of 32-bit, single-precision, floating-point words. 
Instructions that are labeled as scalar instructions operate on a 
single set of 32-bit operands (from the low halves of the two 
64-bit operands). 

The 3D single-precision, floating-point format is compatible 
with the IEEE-754, single-precision format. This format 
comprises a 1-bit sign, an 8-bit biased exponent, and a 23-bit 
significand with one hidden integer bit for a total of 24 bits in 
the significand. The bias of the exponent is 127, consistent with 
the IEEE single-precision standard. The significands are 
normalized to be within the range of [1,2). 

In contrast to the IEEE standard that dictates four rotmding 
modes, 3D technology supports one rounding mode — either 
round-to-nearest or round-to-zero (truncation). The hardware 
implementation of 3D technology determines the rounding 
mode. The processor implements round-to-nearest mode. 
Regardless of the rounding mode used, the 
floating-point-to-integer and integer-to-floating-point 
conversion instructions, PF2ID and PI2FD, always use the 
round-to -zero (truncation) mode. 

The largest-representable normal nxunber in magnitude for this 
precision in hexadecimal has an exponent of FEh and a 
significand of 7FFFFFh, with a numerical value of 2^^'^ (2 - 2'^^), 
All results that overflow above the maximum-representable 
positive value are saturated to either this 
maximum-representable normal number or to positive infinity. 
Similarly, all results that overflow below the 
minimum-representable negative value are saturated to either 
this minimum-representable normal number or to negative 
infinity. 
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The implementation of 3D technology determines how 
arithmetic overflow is handled — either properly signed 
maximum- or minimum-representable normal numbers or 
properly signed infinities. The AMD-K6 3D processor generates 
properly signed maximum- or minimum-representable normal 
numbers. 

Infinities and NaNs are not supported as operands to 3D 
instructions. 

The smallest-representable normal number in magnitude for 
this precision in hexadecimal has an exponent of Olh and a 
significand of OOOOOOh, with a numerical value of 2"^^^. 
Accordingly, all results below this minimum-representable 
value in magnitude are held to zero. Table 16 shows the 
exponent ranges supported by the 3D technology. 



Table 16. 3D Technolc^ Exponent Ranges 



Biased 
Exponent 


Description 


FFh 


Unsupported * 


OOh 


Zero 


00hO{<FFh 


Normal 


Olh 


2 0-127) lowest possible exponent 


FEh 


2 (254^127) largest possible exponent 


Note: 

* Unsupported numbers can be used as operands. The results of 
operations with unsupported numbers are undefined. 



Like MMX instructions, 3D instructions do not generate 
exceptions nor do they set any status flags. It is the user's 
responsibility to ensure that in-range data is provided to 3D 
instructions and that all computations remain within valid 
ranges (or are held as expected). 

3D Execution Resources 

The register operations of all 3D floating-point instructions are 
executed by either the register X unit or the register Y unit. 
One operation can be issued to each register unit each clock 
cycle, for a maximum issue and execution rate of two 3D 
operations per cycle. All 3D operations have an execution 
latency of two clock cycles and are fully pipelined. 
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Even though 3D execution resources are not duplicated in both 
register units (for example, there are not two pairs of 3D 
multipliers, just one shared pair of multipliers), there are no 
instruction-decode or operation-issue pairing restrictions. 
When, for example, an 3D multiply operation starts execution 
in a register imit, that unit grabs and uses the one shared pair of 
3D multipliers. Only when actual contention occurs between 
two 3D operations starting execution at the same time is one of 
the operations held up for one cycle in its first execution pipe 
stage while the other proceeds. The delay is never more than 
one cycle. 

For code optimization purposes, 3D operations are grouped into 
two categories. These categories are based on execution 
resources and are important when creating properly scheduled 
code. As long as two 3D operations that start execution 
simultaneously do not fall into the same category, both 
operations will start execution without delay. 

The first category of instructions contains the operations for the 
following 3D instructions: PFADD, PFSUB, PFSUBR, PFACC, 
PFCMPx, PFMIN, PFMAX, PI2FD, PF2ID, PFRCP, and 
PFRSQRT. 

The second category contains the operations for the following 
3D instructions: PFMUL, PFRCPITl, PFRSQITl, and 
PFRCPIT2. 

Note: 3D add and multiply operations, among other 
combinations, can execute simultaneously. 

Normally, in high-performance 3D code, all of the 3D 
instructions are properly scheduled apart from each other so as 
to avoid delays due to execution resource contentions (as well 
as taking into account dependencies and execution latencies). 
For further information regarding code optimization, see the 
Appendix B, "Code Optimization" on page 455, which provides 
in-depth discussions of code optimization techniques for the 
AMD-K6 3D processor. 

The SIMD 3D instructions are summarized in Table 17 on 
page 92. The dedicated and shared execution resources of the 
register X unit and register Y unit are shown in Figure 50 on 
page 91. The execution resources for some MMX operations, as 
well as all 3D operations, are shared between the two register 
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units. For contention-checking purposes, each box represents a 
category of operations that cannot start execution 
simultaneously. In addition, the MMX and 3D multiplies use the 
same hardware, while MMX and 3D adds and subtracts do not. 

The two 3D performance-enhancement instructions are 
summarized in Table 18 on page 92. The FEMMS instruction 
does not use any specific execution resource or pipeline. The 
PREFETCH instruction is operated on in the Load unit. 



Register X Execution 
Pipeline 



Integer NdS 



Integer Shih 



Integer Multiply 
and Divide 



Integer Byte 
operations 



Integer Special 



Integer Segment 
Register Loads 



MMX kUi 
AdtVSubtrad, 
Compare 



Logical, Pack, 
Unpack 



JO 

AdcySubtraa 
Compare, Integer 
Conversion, 
RedpFocal and 
Reciprocal 
Square Root 
Table Lookup 



MMXondSD 
Multiply, 
\ Reciprocal and 
) Reciprocal 
^ Square Root 



Iteration 



m»r Shifter 



I 



Dedicated Register X 
Resources 



Shared Register X and Y 
Resources 



Register Y Execution 
Pipeline 



Integer M\} 



MMXm 
Adc^Subtrad. 
Compare 



AfAOTALU 
Logical, Pack, 
Unpad 



Dedicated Register Y 
Resources 



Figure 50. Register X Unit and Register Y Unit Resources 
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Table 17. 3D noating-Point Instructions 



Operation 


Function 


Opcode 
sutnx 


PAVGUSB 


Packed 8-bit Unsigned Integer Averaging 


Drh 


PFADD 


Packed Floating-Point Addition 


9Eh 


PFSUB 


Packed Floating-Point Subtraction 


9An 


PFSUBR 


Packed Floating-Point Reverse Subtraction 


AAh 


PFACC 


Floating-Point Accumulate 


AEh 


PFCMPGE 


Packed Floating-Point Comparisoa Greater or Equal 


90h 


PFCMPGT 


Packed Floating-Point Comparison, Greater 


AOh 


PFCMPEQ 


Packed Floating-Point Comparison, Equal 


BOh 


PFMIN 


Packed Floating-Point Minimum 


94h 


PFMAX 


Packed Roating-Point Maximum 


A4h 


PI2FD 


Packed 32-bit Integer to Floating-Point Conversion 


ODh 


PF2ID 


Packed Floating-Point to 32-bit Integer 


iDh 


PFRCP 


Floating-Point Reciprocal Approximation 


96h 


PFRSQRT 


Floating-Point Reciprocal Square Root Approximation 


97h 


PFMUL 


Packed Floating-Point Multiplication 


B4h 


PFRCPm 


Packed Roating-Point Reciprocal Rrst Iteration Step 


A6h 


PFRSQITl 


Packed Roating-Point Reciprocal Square Root Rrst Iteration Step 


A7h 


PFRCPrT2 


Packed Roating-Point Reciprocal/Reciprocal Square Root Second Iteration Step 


B6h 


PMULHRW 


Packed 16-bit Integer Multiply with rounding 


B7h 


Table 18. 3D Performance-Enhancement instructions 


Operation 


Function 


Opcode 
Suffix 


FEMMS 


Faster entry/exit of the MMX or floating-point state 


OEh 


PREFETCH 


Prefetch at least a 32-byte line into LI data cache 


ODh 



Table 15 on page 79 contains a complete list of 3D instruction 
mnemonics. 
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Task Switching 

With respect to task switching, treat the 3D instructions exactly 
the same as MMX instructions. Operating system design must 
be taken into account when writing an 3D program. 

The programmer must know whether the operating system 
automaticaDy saves the current states when task switching, or if 
the 3D program has to provide the code to save states. 

If a task switch occurs, the Control Register (CRO) Task Switch 
(TS) bit is set to 1. The processor then generates an interrupt 7 
(int 7— Device Not Available) when it encounters the next 
floating-point, 3D, or MMX instruction, allowing the operating 
system to save the state of the 3D/MMX/FP registers. 

In a multitasking operating system, if there is a task switch 
when 3D/MMX applications are running with older applications 
that do not include MMX instructions, the MMX/FP register 
state is still saved automatically through the int 7 handler. For 
more information, see "Task Switching" on page 356. 

3D Exceptions 

Table 19 contains a list of exceptions that 3D and MMX 
instructions can generate. 



Table 19. 3D and MMX Instruction Exceptions 



Exception 


Real 


Virtual 

8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 


X 


X 


X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment ovemm (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 


X 


X 


A page fault resulted from the execution of the instruction. 


FloatingiMNnt exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating^Mjint execution unit 


ATtgnment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. (In 
Protected Mode CPL= 3.) 
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The rules for exceptions are the same for both MMX and 3D 
instructions. In addition, exception detection and handling is 
identical for MMX and 3D instructions. None of the exception 
handlers need modification. 

Notes: 

L An invalid opcode exception (interrupt 6) occurs if an 3D 
instruction is executed on a processor that does not 
support 3D instructions, 

2. If a floating-point exception is pending and the processor 
encounters an 3D instruction, FERRU is asserted and, if 
CRO.NE = 1, an interrupt 16 is generated. (This is the 
same for MMX instructions,) 

Prefixes 

The following prefixes can be used with 3D instructions: 

■ The segment override prefixes (2Eh/CS, BGh/SS, 3Eh/DS, 
26h/ES, 64hyFS, and 65h/GS) affect 3D instructions that con- 
tain a memory operand. 

■ The address-size override prefix (67h) affects 3D instruc- 
tions that contain a memory operand. 

■ The operand-size override prefix (66h) is ignored. 

■ The LOCK prefix (FOh) triggers an invalid opcode exception 
(interrupt 6). 

■ The REP prefixes (F3h/ REP/ REPE/ REPZ, F2h/ REPNE/ 
REPNZ) are ignored. 
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For information about 3D instruction coding techniques, see 
"AMD-K6 3D Processor Multimedia Coding Optimizations'* on 
page 482 in Appendix B. 

Division and Square Root 

For information about performing division and find square 
roots with 3D instructions, see "Division*' on page 497 and 
**Square Root and Reciprocal Square Root" on page 498, both 
in Appendix B. 

3D Instruction Set 



The following 3D instruction definitions are in alphabetical 
order according to the instruction mnemonics. 
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FEMMS 



mnemonic 






opcode 


description 


FEMMS 






OFOEh 


Faster Enter/Exit of the MMX or floating-point state 


Privilege: 

Registers Affected: 
Flags Affected: 
Exceptions Generated: 






none 
MMX 
none 




Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX Instruction bit (EM) of the control register (CRO) is set 
tol. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the 
control register (CRO) h set to 1. 


Floating^ Dim exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 



Like the EMMS instruction, the FEMMS instruction can be used to clear the MMX 
state following the execution of a block of MMX instructions. Because the MMX 
registers and tag words are shared with the floating-point unit, it is necessary to clear 
the state before executing floating-point instructions. Unlike the EMMS instruction, 
the contents of the MMX/floating-point registers are undefined after a FEMMS 
instruction is executed. Therefore, the FEMMS instruction offers a faster context 
switch at the end of an MMX routine where the values in the MMX registers are no 
longer required. FEMMS can also be used prior to executing MMX instructions where 
the preceding floating-point register values are no longer required, which facilitates 
faster context switching. 
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PAVGUSB 

mnemonic opcod^suffix description 

PAVGUSB mmregl , mmreg2/inem64 OF OFh / BFh Average of unsigned packed 8-bit values 

Privilege: None 

Registers Affected: MMX 

Flags Affected: None 
Exceptions Generated: 



Exception 


Real 


Virtue 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate Instruction bit (EM) of the control register (CRO) is set to I. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to K 


stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effedh^e address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 

to OFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Ftoatingi)oint exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL=3.) 



The PAVGUSB instruction produces the rounded averages of the eight unsigned 8-bit 
integer values in the source operand (an MMX register or a 64-bit memory location) 
and the eight corresponding unsigned 8-bit integer values in the destination operand 
(an MMX register). It does so by adding the source and destination byte values and 
then adding a OOlh to the 9-bit intermediate value. The intermediate value is then 
divided by 2 (shifted right one place) and the eight unsigned 8-bit results are stored 
in the MMX register specified as the destination operand. 

The PAVGUSB instruction can be used for pixel averaging in MPEG-2 motion 
compensation and video scaling operations. 
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Functional Illustration of the PAVGUSB Instruction 



mmreg2/mein64 



mmregl 



FFh 
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Olh 
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70h 


07h 
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3 
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byte av 


eragin^ 






0 


FFh 


OOh 
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lOh 


Olh 


44h 
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FFh 


80h 


80h 


lOh 


Olh 
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7Fh 


Alh 1 



■ Indicates a value that was rounded-up 

The following list explains the functional illustration of the PAVGUSB instruction: 

■ The rounded byte average of FFh and FFh is FFh. 

■ The rounded byte average of FFh and OOh is 80h« 

■ The rounded byte average of Olh and FFh is also 80h. 

■ The rounded byte average of OFh and lOh is lOh. 

■ The rounded byte average of OOh and Olh is Olh. 

■ The rounded byte average of 70h and 44h is 5Ah. 

■ The rounded byte average of 07h and F7h is 7Fh. 

■ The rounded byte average of 9Ah and A8h is Alh. 

The equations for byte averaging with rounding are as follows: 

■ mmregl[63:56] = (mmregl [63:56] + mmreg2/mem64[63:56] + 01h)/2 
n mmregl[55:48] = (mmregl[55:48] + mmreg2/mem64[55:48] + 01h)/2 

■ mmregl[47:40] = (mmregl [47:40] + mmreg2/mem64[47:40] + 01h)/2 

■ mmregl [39:32] = (mmregl [39:32] + mmreg2/mem64[39:32] + 01h)/2 

■ mmregl[31:24] = (mmregl [31:24] + mmreg2/mem64[31:24] + 01h)/2 

■ mmregl[23:16] = (mmregl [23:16] + mmreg2/mem64[23:16] + 01h)/2 

■ mmregl[15:8] = (mmregl[15:8] -i- mmreg2/mem64[15:8] + 01h)/2 

■ mmregl[7:0] = (nmiregl[7:0] + mmreg2/mem64[7:0] + 01h)/2 
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3D Technology 



PF2ID 

mnemonic opcode/imm8 description 

PF2ID mmregl, mmreg2/mem64 OFh OFh / 1 Dh Converts packed floating-point operand to packed 

32-bit integer 

Privilege: none 
Registers Affected: MMX 
Flags Affected: none 
Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
re9ster(CR0)bsettol. 


stack exception {12) 




X 


During instruction execution, the stack segment limit was exceeded. 


General protection (15) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruak)n data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted h-om the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignnient check (17) 




X 


X 


An unaligned memorY reference resulted from the Instruction executioiv 
and the alignment mask bit (AM) of the control register (CRO) b set to 1. 
(In Protected Mode, CPL=3.) 



PF2ID is a vector instruction that converts a vector register containing 
single-precision, floating-point operands to 32-bit signed integers using truncation. 
Table 20 on page 100 shows the numerical range of the PF2ID instruction. 

The PF2ID instruction performs the following operations: 

IF ([nmreg2/niem64C31:0] >= 2^^ 

THFN mmregl[31:0] = 7FFF_FFrFh 
ELSEIF (nimreg2/mem64[31:0] <= -2^^ 

THEN nimregl[31:0] = BOOO^OOOOh 
ELSE niiiiregl[31:0] = int(mnireq2/meni64[31 :03 ) 
IF (mmreg2/mem64[63:32] >= 2^-) 

THEN nmregl[63:32] - 7FFF_FFFFh 
ELSEIF (mmreg2/mem64[63:32] <= 

THEN nimregl[63:32] = 8000_0000h 
ELSE niinregl[63:32] « i nt(mmreg2/inein64[63:32] ) 
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Table 20. Numerical Range for the PF2ID Instruction 



Source 2 


Source 1 and Destination 


0 


0 


Normal, abs(Source 1)<1 


0 


Normal. -2147483648 < Source 1 <= -1 


round to zero (Source 1) 


Normal, 1 <= Source 1< 2147483648 


round to zero (Source 1) 


Normal, Source 1 >= 2147483648 


7FFF_FFFFh 


Normal Source 1 <= -2147483648 


8000_0000h 


Unsupported 


Undefined 



Related Instructions See the PI2FD instruction. 
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mnemonic opcode/immS desaiption 

PFACC mmregl , mmreg:^mem64 OFh OFh / AEh Floating-point accumulate 

Privilege: none 

Registers Affected: MMX 

Rags Affected: none 
Exceptions Generated: 



Exception 


Real 


Virtaal 
8086 


Protected 


Desaiption 


Invalid opcode (5) 


X 


X 


X 


The emulate instruction bit (EM) of the control re^er (CRO) is set to L 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if ttie task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory locatioa 


Segment ovemin (13) 


X 


X 




One of the instruction data operands faHs outside the address range GOOOOh 
to OFFFFh. 


Page fault (14) 




X 


X 


A page faub resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode. CPL= 3.) 



PFACC is a vector instruction that accumulates the two words of the destination 
operand and the source operand and stores the resuhs in the low and high words of 
destination operand respectively. Both operands are single-precision, floating-point 
operands with 24-bit significands. Table 21 on page 102 shows the numerical range of 
the PFACC instruction. 



The PFACC instruction performs the following operations: 

mmreglC3I:03 « nimregl[31 : 0] + minregl[63;32] 
mnireglC63:32J - Pimreg2/mem64C31 :0] + mmreg2/mem64C63 : 323 
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Table 21. Numerical Range for the PFACC Instruction 





Source 2 


0 


Normal 


Unsupported 


Source 1 and 
Destination 


0 




Source 2 


Source 2 


Normal 


Source 1 


Noniiat+/-0** 


Undefined 


Unsupported 


Source 1 


Undefined 


Undefined 


Notes: 

* The sign of the resuft is the logical AND of the signs of the source operands. 

** If the absolute value of the result is less then 2 ~ the result is zero with the sign beina the sian of the source operand 
that is larger in magnitude (if the magnitudes are equal, the sign of source I is used). If the absolute value of the result 
is greater than or equal to 2 the result is the largest normal number with the sign being tf» sign of the source operand 
that is larger in magnitude. 
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mnemonic opcod^immS description 

PFADD mmregl , inmreg2/mem64 OFh OFh / 9Eh Packed, floating-point addition 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 
Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate instruction bit (EM) of the control register (CRO) is set to I. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state If the task switch bit (fS) of the control 

register (CRO) is set tel. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OODOOh 

to OFFFFh. 


Page fault (14) 




X 


X 


A page fauh resulted h'om the execution of the instructnn. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 


X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask hit (AM) of the control register (CRO) b set to 1. 
On Protected Mode, CPL» 3.) 



PFADD is a vector instruction that performs addition of the destination operand and 
the source operand. Both operands are single-precision, floating-point operands with 
24-bit significands. Table 22 on page 104 shows the numerical range of the PFADD 
instruction. 



The PFADD instruction performs the following operations: 

minreglC31:03 = mmregl[31:03 + mmreg2/mem64[31 :0j 
mniregl[63:32] = mniregl[63:32] + (nmreg2/mem64[63:32] 
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TaMe 22. Numerical Range for the PFADD Instruction 





Source 2 


0 


Normal 


Unsupported 


Source 1 and 
Destination 


0 


+/-0* 


Source 2 


Source 2 


Normal 


Source 1 


Normal, +/- 0 ** 


Undefined 


Unsupported 


Source 1 


Undefined 


Undefined 


Moles.* 

* The sign of the result is the logical AND of the signs of the source operands. 

** If the absolute value of the result is less then 2 ' the result is zero with the sign bemg the sian of the source operand 
that is larger in mogn'rtude (if the magnitudes are equal, the sign of source i is used). If the oBsolute value of me result 
is greater than or equal to 2 the result is the largest normal number with the sign being the sign of the source operand 
that is larger in magnitude. 
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mnemonic opcod^immB description 

PFCMPEQ mmregl , mmreg2/mem64 OFh OFh / BOh Packed floating-point comparison, equal to 

Privilege: none 
Registers Affected: MMX 
Flags Affected: none 

Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate instrudnn Irit (EM) of the control register (CRO) is set to 1. 


Device not available <7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction executioa the effective address of one of the segment 
registers used for the operand points to an fliegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOGOh 
to OFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction executnn, 
and the alignment mask bit (AM) of the control renter (CRO) is set to 1. 
(In Protected Mode, GPL s 3.) 



PFCMPEQ is a vector instruction that performs a comparison of the destination 
operand and the source operand and generates all one bits or all zero bits based on the 
result of the corresponding comparison. Table 23 on page 106 shows the numerical 
range of the PFCMPEQ instruction. 

The PFCMPEQ instruction performs the following operations: 

IF (mmregl[31:0] = mmreg2/mem64[31 :0] ) 

THEN mmregl[31:0] = FFFF_FFFFh 
ELSE mniregl[31:0] - OOOO.OOOOh 
IF (nimregl[63:32] = mmreg2/mem64[63 : 32] 

THEN mmregl [63: 32] = FFFF_FFFFh 
ELSE mmregl[63:32] = 0000_0000h 
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Table 23. Numerical Range for the PFCMPEQ Instruction 





Source 2 


0 


Normal 


Unsupported 


Source 1 and 
Destination 


0 


FFFF_FFFFh * 


OOOO.OOOC^ 


OOOO.OOOOh 


Normal 


0000_0000h 


OOOO^OOOOh, 
FFFF_FFFFh** 


OOOO.OOOOh 


Unsupported 


OOOO.OOOOh 


0000_0000h 


Undefined 


NiOes: 

* Positive zero is equal to negative zero 

** The result is FFFF fTFFh if source i and source 2 have identical signs, exponents, and mantissas. Othenm^ 
QOOOJXMi 



Related InslracUons See the PCMPGE instruction. 

See the PCMPGT instruction. 
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5D Technology 



PPCRfflPSE 



mnemomc 



opcod^immB description 



PCMPGE mmregl, minreg2/mem64 OFh OFh / 90h 



Packed floating-point comparison, greater than or 
equal to 



Privilege: 

Registers Affected: 
Rags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floatingiwint or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment rimit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an Illegal menwry tocation. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


AKgnment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, GPL = 3.) 



PFCMPGE is a vector instruction that performs a comparison of the destination 
operand and the source operand and generates all one bits or all zero bits based on the 
result of the corresponding comparison. Table 24 on page 108 shows the numerical 
range of the PFCMPGE instruction. 

The PFCMPGE instruction performs the following operations: 

IF (mmregl[31 :0] >= mmreg2/meni64[31 :0] ) 

THEN mmregl[31:0] = FFFF_FFFFh 
ELSE inmregl[31:0] = OOOO.OOOOh 
IF (mmregl[63:32] >= mmreg2/meni64[63:32] 

THEN mmreglC63:32] = FFFF_FFFFh 
ELSE mmregl[63:32] - 0000_0000h 
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Table 24. Numerical Range for the PFCAAPGE Instruction 







Source 2 






0 


Normal 


Unsupported 




0 


FFFF.FFFFh * 


OOOO.OOOOh, 
FFFF_FFFFh** 


Undefined 


Source 1 and 
Destmation 


Normal 


OOOO.OOOOh, 
FFFF^FFFF*** 


0000_0000h, 
FFFF.FFFFh **** 


Undefined 




Unsupported 


Undefined 


Undefined 


Undefined 


* Positive zero is equal to negative zero. 

** Jhe result s FFFfJFFFh, if source 2 is negative. Otherwise, the resuH is 0000 jxmh. 
*** Jhe result is FFFFJFFFh, if source / is positive. Merwis^ the result is 0000 jm)h. 

**** Jhe result is FFFFJFFFh, if source 1 is positive and source 2 is negative or if they are both negative and source / is smaller 
than or equal in magnitude to source 2, or if source / and source 2 are both positive and source 1 is greater than or equal 
in magnitude to source 2 Ti^e result is OOOOjOOOOh in dl other cases 



Relateill hasftrucftloiis See the PCMPEQ instruction. 

See the PCMPGT instruction. 




1 77 AM D0060 144 



3D Technology 4 

IPIFCRfflP(5¥ 

mnemonic opcode/immS description 

PCMPCT mmregl , mmreg2/mem64 OFh OFh / AOh Packed floating-point comparison, greater than 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 

Bcceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Oescriptioji 


Invalid opcode (6) 


X 


X 


X 


The emulate instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (U) 






X 


During instruction execut»n. the stack segment Imit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outskie the address range OOOOOh 
to OFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit. 


Alignment check (17) 




X 


X 


An unaligned memoiy reference resulted h^m the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, GPL =3.) 



PFCMPGT is a vector instruction that performs a comparison of the destination 
operand and the source operand and generates all one bits or all zero bits based on the 
result of the corresponding comparison. Table 25 on page 110 shows the numerical 
range of the PFCMPGT instruction. 

The PFCMPGT instruction performs the following operations: 

IF (mmregl[31 :0] > mmreg2/mem64[31 :0] ) 

THEN mmregl[31:03 - FFFF_FFFFh 
ELSE fnmregl[31:0] = 0000_0000h 
IF (mmregl[63:32] > mmreg2/meni64[63 : 32] 

THEN mmregl[63:32] = FFFF_FFFFh 
ELSE m(nreglC63:32] - 0000_0000h 
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Table 25. Numerical Range for the PFCMPGT Instruction 







Source 2 






0 


Normal 


Unsupported 




0 


0000_0000l) 


0000_0000h, 
FFFF_FFFFh * 


Undefined 


Source 1 and 
Destination 


Normal 


OOOO^OOOOh, 
FFFF_FFFF** 


OOOO.OOOOh, 
FFFF_FFFFh *** 


Undefined 




Unsupported 


Undefined 


Undefined 


Undefined 


Motes.' 

* The result is FFFF_fFFFb, source 2 is negdtm. Otherme, the result is OOOOJXXXk 
** The result is FFFFJFFFh, if source 1 is positi^. Otheiwise, the result is 0000_0000h. 

*** The result is FFFF_FFFFb, if source i is positive and source 2 is negative, or if they are both negative and source I is smaller 
in magniude than source 2, or if source 1 and source 2 are positive and source 1 is greater m magnitude than source 2. 
The result is 0000_0000h in aff other cases. 



Related BnstrucUons 



See the PCMPEQ instruction. 
See the PCMPGE instruction. 
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mnmonic 



opcod^tmmS description 



PFMAX mmregl , nunreg2/mefii64 OFh OFh / A4h Packed floating-point maximum 



Privilege: 

Registers Affected: 
Rags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exception 


Real 


Virtual 
8086 




Desoriptiofi 


Invalid opcode (6) 


X 


X 


X 


The emulate instnidion bit (EM) of the control register (C3W) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instnjctlon execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overnin (13) 


X 


X 




One of the instnidion data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit. 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bK (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL = 3.) 



PFMAX is a vector instruction that returns the larger of the two single-precision, 
floating-point operands. Any operation with a zero and a negative number returns 
positive zero. An operation consisting of two zeros returns positive zero. Table 26 on 
page 112 shows the numerical range of the PFMAX instruction. 

The PFMAX instruction performs the following operations: 

IF (mmregl[31 :0] > mnireg2/niem64[31 :0] ) 

THEN mmregl[31:0] = ramregUSl : 0] 
ELSE mniregl[31:0] - mmreg2/inem54[31 :0: 
IF (nimreglC53:32] > niinreg2/mem64[53 : 32] ) 

THEN mniregl[63:32] - mmregl[63:32] 
ELSE mmreglC63:32] = inmreg2/mem64C63:32] 
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Table 26. Numerical Range for the PFMAXbistnicdon 





Source! 


0 


Normal 


Unsupported 


Source 1 and 
Destiiiation 


0 


+0 


Source 2, +0 * 


Undefined 


Normal 


Source 1. +0 ** 


Source 1/Source 2 *** 


Undefined 


Unsupported 


Undefined 


Undefined 


Undefined 


Mites: 

* The result is source I if source 2 is poshi^ Otherme, the result is positive zero. 
** The result is source 1, if source 1 is positive. Otherwise, the result is positive zero. 

*** The result is source 1, if source 1 is positive and source 2 is negative. The result is source I If both are positive and source 1 
is greater in magnitude than source 2 The result is source I (fboth are negative and source / is lesser in magnitude than 

source 2. The result is source 2 in al! other cases 



Related Instructions See the PFMIN instruction. 
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PFMIN 

mnemonic opcode/immS 
PFMIN mmregl , mnireg2/mem64 OFh OFh / 94h 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 
Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcx)de (6) 


X 


X 


X 


The emidate instruction bit (EM) of the control register (CRO) Is set to l. 


Device not available (7) 


X 


X 


X 


Save the floating-point or Mf\ilX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Slack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fauh resulted from the execution of the instruction. 


Ho3t)ng*point exception 
pending (15) 


X 


X 


X 


An exception is pending due to the floating-point execution unit. 


Alignment check (17) 

I 


X 


X 


An unaligned memory reference resulted from the tnstniction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL = 3.) 



PFMIN is a vector instruction that returns the smaller of the two single-precision, 
floating-point operands. Any operation with a zero and a positive number returns 
positive zero. An operation consisting of two zeros returns positive zero. Table 27 on 
page 114 shows the numerical range of the PFMIN instruction. 

The PFMIN instruction performs the following operations: 

IF (mmreglC31 :0] < mnireg2/mem64[31 : 0] ) 

THEN mmregl[31:0] - mmreglC31:0] 
ELSE raniregl[31:0] = nimreg2/ineni64[31 :0] 
IF (mmregl[63:32] < rnmreg2/mem64C63:32]) 

THEN mmregl [63: 321 = inmregU63:32] 
ELSE mniregl[63:321 - inmreg2/niem64[63:32] 



description 

Packed floating-point minimum 



l/J 
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Table 27. Numerical Range for the PFMIN Instruction 





Source 2 


0 


Normal 


Unsupported 


Source 1 and 
Destination 


0 


+0 


Source 2, +0 * 


Undefined 


Normal 


Source 1, +0 ** 


Source 1/Source 2*^ 


Undefined 


Unsupported 


Undefined 


Unddined 


Undefined 



* The result s source Z if source 2 is negative Otherwise, the result is positi\/e zero. 
** The result b source I if source 1 is negative. Otherwise, the result is positive zero. 

*** The result is source I if source 1 is negative and source 2 is positive. The result is source I if both are negative and source 
1 is greater in magnitude than source Z The result is source I if both are positive and source / is lesser in magnitude than 
source 1 The result is source 7 in all other coses. 



Related Instructions See the PFMAX instruction. 
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mnemonic opcode/immS description 

PFMUL mmregl , mmreg2/mem64 OFh OFh / B4h Packed floating-point multiplication 

Privilege: none 
Registers Affected: MMX 
Flags Affected: none 

Exceptions Generated: 



Exception 


Virtual 
Real 8088 


Protected 


Description 


Invalid opcode (6) 


X I X 


X 


"Ttie emulate instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X X 


X 


Save the floating-point or MMX state If the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (15) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 


X 


X 


A page fault resulted from the execution of the mstruction. 


floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the ftoating-point execution unit 


Alignment checJc (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
On Protected Mode, CPL= 5.) 



PFMUL is a vector instruction that performs multiplication of the destination 
operand and the source operand. Both operands are single-precision, floating-point 
operands with 24-bit significands. Table 28 on page 116 shows the numerical range of 
the PFMIN instruction. 



The PFMUL instruction performs the following operations: 

mmregl [31 :0] = mmregl[31:0] * mmreg2/mem64[31 ;0] 
mmregl [63: 32] = mmregl[63:32] * mmreg2/meni64[63:32] 
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Table 28. Numerical Range for the PFMUL Instruction 





Source 2 


0 


Normal 


Unsupported 


Source 1 and 
Destination 


0 


+/-0* 


+/-0* 


+/-0* 


Nomial 


+/-0* 


Normal, +/-0** 


Undefined 


Unsupported 


+/-0* 


Undefined 


Undefined 


/Votes: 

* The sign of the result is the exdusive-OR of the signs of the source operands. 

** If the absolute value of the result is less then 2 the result is zero virith the sign beina the exdusiveOR of the signs of the 
source operands. If the absolute value of^e product is greater than or equal to 2 Sie result is the largest normal number 
with the sign being exckjsive-OR of the signs of d)e source operands. 
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5D Technology a 

mnemonic opcode/immB description 

PFRCP mmregl, mmreg2/mem64 OFh OFh / %h Floating-point reciprocal approwmation 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 

Exceptions Generated: 



Exception 


Real 


Virtual 
8088 


Protected 


Desaiption 


Invalid opcode (6) 


X 


X 


X 


The emulate instniction bit (EM) of the ojntrol register (CRO) b set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is s^ to 1. 


Slack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


Genera) protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OODOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fauK resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception 5 pending due to the floating-point execution unit 


Alignment chedc (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
On Protected Mode, CPL = 3.) 



PFRCP is a scalar instruction that returns a low-precision estimate of the reciprocal of 
the source operand. The single result value is duplicated in both high and low halves 
of this instruction's 64-bit result. The source operand is single-precision with a 24-bit 
significand, and the result is accurate to 14 bits. Table 29 on page 118 shows the 
numerical range of the PFRCP instruction. 

Increased accuracy (the full 24 bits of a single-precision significand) requires the use 
of two additional instructions (PFRCPITl and PFRCPIT2). The first stage of this 
increase or refinement in accuracy (PFRCPITl) requires that the input and output of 
the already executed PFRCP instruction be used as input to the PFRCPITl 
instruction. Refer to "3D Instruction Coding" on page 95 for an application-specific 
example of how to use this instruction and related instructions. 

The PFRCP instruction performs the following operations: 

mmregl[31 :0] = reci procal ( mn)reg2/n)em64[31 :0] ) 
mmregl [63:32] - reci procal (mmreg2/meni64[31 :0] ) 

m 
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3D Technology 



In the following code example, the bold line illustrates the PFRCP instruction in a 
sequence used to compute q = aA) accurate to 24 bits: 

q 



PFRCP(b) 

PFRCPITKb.Xo) 

PFRCPIT2(Xi.Xo) 

PFMUKa.Xz) 



Table 29. Numerical Range for the PFRCP Instruction 





Source 1 and 
Destination 


Source 2 


0 


V- Maximum Normal* 


Normal 


Normal, +/- 0 ** 


Unsupported 


Undefined 


Notes: 

* The result has the same sign as the source operand. 

** If the absolute value of the result is less then 2 the result is zero with the sign being the sign of the source operand 
Otherwist the result is a normal with the sign being the some sign as the source operand. 



Related Boistnictions 



See the PFRCPITl instruction. 
See the PFRCPIT2 instruction. 
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PFRCPITI 

mnemonic opcodefmmB description 

PFRCPITI mmregi , mmreg2/mem64 OFh OFh / A6h Packed floating-point reciprocal, first iteration step 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 
Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate instruction bit (EM) of the control register (CRO) b set to I 


Device not available (7) 


X X 


X 


Save the floating-point or MMX state if the task switch bit (5) of the control 
rie9*ster(CR0)isselto1. 


Stack exception (12) 


— ^ — 


X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 




X 


During instruction execution, the effective address of one of the segment 
regbters used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Pagefauh(14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Atignmem check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRD) Is set to 1. 
On Protected Mode, CPL= 3.) 



PFRCPITI is a vector instruction that performs the first step in a Newton-Raphson 
iteration to refine the reciprocal approximation produced by the PFRCP instruction 
(the second and final step yields a result accurate to 24 bits). Table 30 on page 120 
shows the niunerical range of the PFRCPITI instruction. 

The behavior of this instruction is only defined for those combinations of operands 
such that one source operand was the input to the PFRCP instruction and the other 
source operand was the output of the same PFRCP instruction. Refer to "3D 
Instruction Coding" on page 95 for an application-specific example of how to use this 
instruction and related instructions. 

In the following code example, the bold line illustrates the PFRCPITI instruction in a 
sequence used to compute q = a/b accurate to 24 bits: 

Xq = PFRCP(b) 

Xi - PFRCPITl(b,Xo) 

= PFRCPIT2(Xi.Xo) 
q = PFMUL(a.X2) 

119 
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Table 30. Numerical Range for the PFRCPITI Instruction 





Source 2 


0 


Normal 


Unsupported 


Source 1 and 
Destination 


0 


+/-0* 


+/-0* 


+/-0* 


Normal 


+/-0* 


Normal ** 


Undefined 


Unsupported 


+/-0* 


Undefined 


Undefined 


Notes: 

* The s^n of the result is the exdusiveOR of the signs of the source operands. 
** The sgn is positive. 



Related instructions See the PFRCP instruction. 

See the PFRCPIT2 instruction. 
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5D Technology 



PFRCPIT2 

mnemonic opcod^immS description 

PFRCPIT2 mmregl, mmreg2/mem64 OFh OFh / B6h Packed floating-point reciprocal/reciprocal square 

root, second iteration step 

Privilege: none 
Registers Affected: MMX 
Rags Affected: none 

Exceptions Generated : 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate instnjdion bit (EM) of the control register (CRO) is set to i. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effecthre address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
to OFFFFh, 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instniction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pendhg due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL= 3.) 



PFRCPIT2 is a vector instruction that performs the second and final step in a 
Newton-Raphson iteration to refine the reciprocal or reciprocal square root 
approximation produced by the PFRCP and PFSQRT instructions, respectively. 
Table 31 on page 122 shows the nimierical range of the PFRCPIT2 instruction. 

The behavior of this instruction is only defined for those combinations of operands 
such that the first source operand (mmregl) was the output of either the PFRCPITl or 
PFRSQITl instructions and the second source operand (mmreg2/mem64) was the 
output of either the PFRCP or PFRSQRT instructions. Refer to "3D Instruction 
Coding*' on page 95 for an application-specific example of how to use this instruction 
and related instructions. 
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In the following code example, the bold line illustrates the PFRCPIT2 instruction in a 
sequence used to compute q = a/b accurate to 24 bits: 

Xc « PFRCP(b) 

Xi - PFRCPITKb.Xo) 

X2 - PFRCPIT2(Xi,Xo) 

q « PFMUL(a,X2) 



Table 31. Numerical Range for the PFRCPIT2 Instruction 





Source 2 


0 


Normal 


Unsupported 


Source 1 and 
Destination 


0 


+/-0* 


+/-0* 


+/-0* 


Normal 


+/-0* 


Normal 


Undefined 


Unsupported 


V-o* 


Undefined 


Undefined 



/Voles; 

* The sign of the result is the exdusiveOR d( the signs of the source operands. 

** if the absolute value of the resu^ is less then 2'^^^, the resuh is zero with the sign bdng the exdusive-OR of the signs of 
source operands. If the absofute value of /he product is greater than or equal to 2 ' , the resuH is the largest normal number 
with the sign being exdusiveOR of the signs of the source operands. 



Related tastructloBS See the PFRCPITl instruction. 

See the PFRSQITl instruction. 
See the PFRCP instruction. 
See the PFRSQRT instruction. 
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5D Technology 




IPiFISS(^DTl 

mnemonic opcode/tmm8 desaiption 

PFRSQITI mmreg1,mmreg2/n^em64 0Fh0Fh/A7h Packed floating-point reciprocal square root first 

iteration step 

Privilege: none 
Registers Affected: MMX 
Flags Affected: none 
Exceptions Generated: 



ExceptioQ 


Real 


Virtual 
8086 


Protected 


Description 


Invafid opcode (6) 


X 


X 


X 


Ihe emulate instnidion bit (EM) of Ihe control register (CRO) is set to 1, 


Device not availaUe (7) 


X 


X 


X 


Save the floating-point or MMX state if ttie task switch bit (IS) of the control 
register (CRO) is set to 1. 


stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instrudfen execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overnin (13) 


X 


X 




One of the instnidion data operands falls outside the address range OOOOOh 
to OFFFFh. 


Page fault (14) 


X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exceptkm is pending due to the floating-point executnn unit 


Alignment ched( (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) Is set to L 
(In Protected Mode, CPL = 5.) 



PFRSQITI is a vector instruction that performs the first step in a Newton-Raphson 
iteration to refine the reciprocal square root approximation produced by the PFSQRT 
instruction (the second and final step is accurate to 24 bits). Table 32 on page 124 
shows the numerical range of the PFRCPIT2 instruction. 

The behavior of this instruction is only defined for those combinations of operands 
such that one source operand was the input to the PFRSQRT instruction and the other 
source operand is the square of the output of the same PFRSQRT instruction. Refer to 
"3D Instruction Coding" on page 95 for an application-specific example of how to use 
this instruction and related instructions. 
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JBL 3D Technology 



In the following code example, the bold lines illustrate the PFMUL and PFRSQITl 
instructions in a sequence used to compute a = l/sqrt (b) accurate to 24 bits: 

Xq » PFRSQRT(b) 

Xi - PFMUKXq.Xo) 

X2 - PFRSQITKb.Xi) 

a - PFRCPIT2(X2.Xo) 



Table 32. Numerical Range for the PFSQITI Instruction 





Source! 


0 


Normal 


Unsupported 


Source 1 and 
Destination 


0 


+/-0* 




+/-0* 


Normal 


+/-0* 


Normal ** 


Undefined 


Unsapported 


+/-0* 


Undefined 


Undefined 


Notes: 

* The sign of the result is the exdu^veOR of the signs of the source operands. 
** The sign is 0. 



Related Instmcttons See the PFRCPIT2 instruction. 

See the PFRSQRT instruction. 
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3D Technology 



PFRSQRT 

mnemonic opcodefimms description 

PFRSQRT mmregl , nimreg2/mem64 OFh OFh / 97h Floating-point reciprocal square root approximation 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 
Exceptions Generated: 



Exception 


Virtual 
Real 8086 


Protected 


Description 


Invalid opcode (6) 


X X 


X 


The emulate instnjdion bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to l. 


Stadc exception (12) 




X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 




X 


During instruction execution the effedh/e address of one of the segment 
registers used for the operand points to an illegal memory location. 


Sequent overrun (13) 


X X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFf=h. 


Page fault (14) 


X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X X 


X 


An exception is pending due to the floating-point executk)n unit 


Alignment check (17) 


X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
On Protected Mode, CPL= 3.) 



PFRSQRT is a scalar instruction that returns a low-precision estimate of the 
reciprocal square root of the source operand. The single result value is duplicated in 
both high and low halves of this instruction's 64-bit result. The source operand is 
single-precision with a 24-bit significand, and the result is accurate to 15 bits. 
Negative operands are treated as positive operands for purposes of reciprocal square 
root computation, with the sign of the result the same as the sign of the source 
operand. Table 33 on page 126 shows the numerical range of the PFRSQRT 
instruction. 

Increased accuracy (the full 24 bits of a single-precision significand) requires the use 
of two additional instructions (PFRSQITl and PFRCPIT2). The first stage of this 
increase or refinement in accuracy (PFRSQITl) requires that the input and squared 
output of the already executed PFRSQRT instruction be used as input to the 
PFRSQITl instruction. Refer to "3D Instruction Coding" on page 95 for an 
application-specific example of how to use this instruction and related instructions. 



125 



177AMD0060161 




The PFRSQRT instruction performs the following operations: 

mmregl[31:0] = reciprocal square root(minreg2/mein64[31 :0] ) 
mmregl[63:32] = reciprocal square root(mmreg2/niem64[31 :03 ) 

In the following code example, the bold line illustrates the PFRSQRT instruction in a 
sequence used to compute a = 1/sqrt (b) accurate to 24 bits: 

Xq - PFRSQRT(b) 

Xi = PFMUKXq.Xq) 

X2 ' PFRSOITKb.Xi) 

X3 - PFRCPIT2(X2.Xo) 

a - PFMUL(b.X3) 



Table 33. Numerical Ranse for PFRSQRT Instruction 





Source 1 and 
Destination 




0 


4^/- Maximum Normal* 


Source 2 


Normal 


Normal * 




Unsupported 


Undefined* 


Nofes: 

* The result has the same sign as the source operand 



Related Instructions See the PFRSQITl instruction. 

See the PFRCPIT2 instruction. 



126 



1 77AM DOO60 162 



5D Technology 



PFSUB 

mnemonk opcod^immS description 



PFSUB mmregl , nimreg2/mem64 OFh OFh / 9Ah Packed floating-point subtraction 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 
Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


DesCTqrtion 


Invalid opcode (6) 


X 


X 


X 


The emulate instructbn bit (EM) of the control register (CRO) is set to 1. 


Devke not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) Is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
regbten used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
to OFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instructk)n. 


Floatingpoint exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


AKgnment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode,CPL»3.) 



PFSUB is a vector instruction that performs subtraction of the source operand from 

the destination operand. Both operands are single-precision, floating-point operands 
with 24-bit significands. Table 34 on page 128 shows the numerical range of the 
PFSUB instruction. 



The PFSUB instruction performs the following operations: 

mmregl[31:0] = frimregl [31 : 0] - mmreg2/mein64 [31 :0] 
nimregl[63:32] = mmregl [63 : 32] - mmreg2/mem64[53 : 32] 
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5D Technology 



liibte 34. Numerical Range for the PFSUB Instruction 





Source 2 


0 


Normal 


Unsupported 


Source 1 and 
Destination 


0 




Source 2 


Source 2 


Normal 


Source 1 


Normal, +/- 0 ** 


Undefined 


Unsupported 


Source 1 


Undefined 


Undefined 


Motes: 

* The $ign of the result is the logical AND of the sign of source / and the inverse of the sign of source 2. 

** If the absolute value of the result is less then 2 "^^^ the result is zero with the sign being the sign of the source operand that is 
larger in magnitude (if the magnitudes ere equal, the sign of source 1 is used). If the absolute value of the result is greater than 
or equal to 2 the result is me brgest normal number ivith the sign being the sign of the source operand that is larger in 

magnitude. 



Related Imtroctions See the PFSUBR instruction. 



128 



1 77AM D0060 164 




mnemonic opcod^tmmS description 

PF5UBR mmregl , mmreg2/mem64 OFh OFh / AAh Packed floating-point reverse subtraction 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 
Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate instnKtion bit (EM) of the control register (CRO) is set to 1. 


Device not avaiable (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch tiil (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one ol the segment 
registers used for the operand points to an illegal memory location. 


Segment oveniin (13) 


X 


X 




One of the instruction data operands falls outside the address range oooooh 
to OFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floatlng-poifit execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode.CPL=3.) 



PFSUBR is a vector instruction that performs subtraction of the destination operand 
from the source operand. Both operands are single-precision, floating-point operands 
with 24-bit significands. Table 35 on page 130 shows the numerical range of the 
PFSUBR instruction. 

The PFSUBR instruction performs the following operations: 

mmregl[31:0] = mmreg2/mem64[31:0] -mmregl[31:0] 
mmregl[63:32] = mrareg2/mem64[ 63:32] - mmregl [63:32] 



1129 



1 77AM D0060 165 




5D Technology 



1^ble35. Numerical Range for the PFSUBR Instruction 





Source 2 


0 


Normal 


Unsupported 


Source 1 and 
Destination 


0 




Source 2 


Source 2 


Noimal 


Source I 


Normal +/-0** 


Undefined 


Unsupported 


Source 1 


Undefined 


Undefined 


Notes: 

* The sign of the result is the logical AND of the sign of source I and the inverse of the sign of source 2. 

** if the absolute value of the result is less then 2 ~ the result is zero with the star) being the sign of the source operand that is 
larger in magnitude (if the magnitudes are equal, the sign of source 2 is used). If the absolute value of the result is greater than 
or equal to 2 the result is me largest normal number with the sign being tiie sign dthe source operand that is larger in 
magnitude. 



Related Instruction See the PFSUB instruction. 
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PI2FD 

mnemonk opcode/immS description 

PI2FD mmregl, ininreg:Vmem64 OFh OFh / ODh Packed 32-bit integer to floating-point conversion 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 

Exceptions Generated 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stadc exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction executioa the effective address of one of the segment 
registers used for the operand points to an illegai memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
to OFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instniction. 


Floating^nt exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit. 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the Instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL=^ 3.) 



PI2FD is a vector instruction that converts a vector register containing signed, 32-bit 
integers to single-precision, floating-point operands. When PI2FD converts an input 
operand with more significant digits than are available in the output, the output is 
truncated. 



The PI2FD instruction performs the following operations: 

mmregl[31:0] = f 1 oat(mmreg2/mem64[31 :0] ) 
miDregl[63:32] = f 1 oat(mmreg2/mem64[63 : 32] ) 

Related Instructions See the PF2ID instruction. 
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PMULHRW 

mnemonic opcod^suffix description 

PMULHRW mmregl , mmreg2/mem64 OF 0Fh/B7h Multiply signed packed 1 6-bit values with rounding 

and store the high 16 bits. 

Privilege: None 
Registers Affected: MMX 
Flags Affected: None 

Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Desaiption 


Invalid opcode (6) 


X 


X 


X 


The emulate instruction bit (EM) of the control register (CRO) is set to t. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fauft (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment diedc (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) b set to 1. 
(In Protected Mode, CPL= 3.) 



The PMULHRW instruction multiplies the four signed 16-bit integer values in the 
source operand (an MMX register or a 64-bit memory location) by the four 
corresponding signed 16-bit integer values in the destination operand (an MMX 
register). The PMULHRW instruction then adds 8000h to the lower 16 bits of the 
32-bit result, which results in the rounding of the high-order, 16-bit result. The 
high-order 16 bits of the result (including the sign bit) are stored in the destination 
operand. 

The PMULHRW instruction provides a numerically more accurate result than the 
PMULMH instruction, which truncates the result instead of rounding. 
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Functional liluslration of the PMULHRW Instruction 



mmreg2/mem64 


D250h 


532 :h 


7O07h 


FFFFh ll 


( 


* 


* 


* 


* 

0 


mmregl 


8807h 


EC22h 


7FFEh 


FFFFh 1 


( 








0 


mmregl 


1569h 


F98Ch 


■ 

3B03h 


OOOOh 1 



Indicates a value that was rounded-up 



The following list explains the functional illustration of the PMULHRW instruction: 

■ The signed 16-bit negative value D250h (-2DB0h) is multiplied by the signed 
16-bit negative value 8807h (-77F9h) to produce the signed 32-bit positive result 
of 1569_4030h. 8000h is then added to the lower 16 bits to produce a final result of 
1569_C030h. This rounding does not affect the final result of 1569h. The signed 
high-order 16 bits of the result are stored in the destination operand, 

■ The signed 16-bit positive value 532 Ih is multiplied by the signed 16-bit negative 
value EC22h (-13DEh) to produce the signed 32-bit negative result of F98C_7662h 
(-0673_899Eh). SOOOh is then added to the lower 16 bits, producing a final result 
of F98C_F662h. This rounding does not affect the final result of F98Ch. The 
signed high-order 16 bits of the result are stored in the destination operand. 

■ The signed 16-bit positive value 7007h is multiplied by the signed 16-bit positive 
value 7FFEh to produce the signed 32-bit positive result of 3802_9FF2h. SOOOh is 
then added to the lower 16 bits to produce a final result of 3803_lFF2h. This result 
has been rounded up. The signed high-order 16 bits of the result (3803h) are 
stored in the destination operand. 

■ The signed 16-bit negative value FFFFh (-1) is multiplied by the signed 16-bit 
negative value FFFFh (-1) to produce the signed 32-bit positive result of 
0000_0001h. 8000h is then added to the lower 16 bits to produce a final result of 
OOOOJOOlh. This rounding does not affect the final result of OOOOh. The signed 
high-order 16 bits of the result are stored in the destination operand. 



133 



1 77 AM D0060 169 



; 5D Technology 



PREFETCH/PREFETCHW 



mnemonic 






opcode 


desaiption 


PREFFrCH(W) memS 






OFODh 


Prefetch processor cache line into LI data cache 


Privilege: 

Registers Affected: 
Flags Affected: 
Exceptions Generated: 






none 
none 
none 




Exception 


Real 


Virtual 
8086 


Protected 

1 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instmdion bit (EM) of the control register (CRO) is set 
tol. 


Devke not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the 
control register (CRO) is set to 1. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 



The PREFETCH instruction loads a processor cache line into the data cache. The 
address of this line is specified by the memS value. For the AMD-K6 3D processor, the 
line size is 32-bytes. In all future processors, the size of the line that is loaded by the 
PREFETCH instruction will be at least 32-bytes. The PREFETCH instruction loads a 
line even if the memS address is not aligned with the start of the line. Although some 
implementations, including the AMD-K6 family of processors, may perform the cache 
fill starting from the caiche miss or memS address. If a cache hit occurs (the line is 
already in the data cache) or a memory fault is detected, no bus cycle is initiated and 
the instruction is treated as a NOR 

In applications where a large number of data sets must be processed, the PREFETCH 
instruction can pre-load the next data set into the data cache while, simultaneously, 
the processor is operating on the present set of data. This instruction allows the 
programmer to explicitly code operation concurrency. When the present set of data 
values is completed, the next set is already available in the data cache. An example of 
a concurrent operation is vertices processing in 3D transformations, where the next 
set of vertices can be prefetched into the data cache while the present set is being 
transformed. 

The PREFETCH instruction format in the processor is defined to allow extensions in 
future AMD K86 processors. The instruction mnemonic for the PREFETCH 
instruction includes the modR/M byte. Only the memory form of modR/M is valid (use 
of the register form results in an invalid opcode exception). Because there is no 
destination register, the three destination register field bits of the modR/M byte are 
used to define the type of prefetch to be performed. The PREFETCH and 
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PREFETCHW instructions are defined by the bit pattern 000b and 001b, respectively. 
All other bit patterns are reserved for future use. 

The PREFETCHW instruction will load the prefetched line and set the cache line 
MESI state to modified (in anticipation of subsequent data writes to the line), unlike 
the PREFETCH instruction, which typically sets the state to exclusive. If the data 
that is prefetched into the data cache is to be modified, use of the PREFETCHW 
instruction will save the cycle that the PREFETCH instruction requires for modifying 
the data cache line state. The AMD-K6 3D processor treats a PREFETCHW 
instruction the same as a PREFETCH instruction. The PREFETCHW instruction 
should be used when the programmer expects that the data in the cache line will be 
modified. Otherwise, the PREFETCH instruction should be used. 

Table 36 summarizes the PREFETCH type options: 
Table 36. Summary of PREFETCH Instruction Type Options 



ModR/M 


Result 


n-xxx-m 


Invalid Opcode 


mm-OOD-xxx 


PREFETCH 


mm-OOl-xxx 


PREFETCHW 


mnvOlO-xxx 


Reserved 


mm-on-xxx 


Reserved 


mm-lOO-xxx 


Reserved 


mm-lO] -XXX 


Reserved 


mm-llO-xxx 


Reserved 


mm-ill-xxx 


Reserved 



Note: The "Reserved'' PREFETCH types do not result in an Invalid Opcode Exception if 
executed. Instead, for forward compatibility with future processors that may 
implement additional forms of the PREFETCH instruction, all "Reserved" 
PREFETCH types are implemented as synonyms for the basic PREFETCH type (for 
example, the PREFETCH instruction with type 000b). 
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Signal Descriptions 



This chapter describes the signals used by the AMD-K6 3D 
processor. Figure 51 on page 138 shows the signals grouped by 
function. The arrows in the figure indicate the direction of the 
signal, either into or out of the processor. Signals with 
double-headed arrows are bidirectional. Signals with pound 
signs (#) are asserted Low. For more information, see "Signal 
Terminology" on page 139. 
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Bus 
Arbitration 



Address 
and 
Address 
Parity 



Cyde 
Definition 
and 
Control 



Cache 
Cdrttrol 



Clock 



LA. 



cue 



BFf2J)] 



Voltage Detection 



VCQDET VCQH/L# 



AHOLD 

BOFF# 

BREQ 

HLDA 

HOLD 



A20M# 
A131.3] 
AP 

ADS# 
ADSC# 
APCHK# 
BE[7.-0]# 



D/C# 

EWB8# 

LOCK# 

M/IO# 

NA# 

sac 



CACHE* 

KEN# 

PCD 

pwr 

mm* 



AMD-K6 5D 
Processor 



BRDY# 
BRDYC# 
D[63:0] 
DP[7;0] 
PCHK# 



EAD5# 
HfT* 
HITM# 
INV 



FERR# 
IGNNE# 



FLUSH* 
INIT 
INTR 
NMl 
RESET 
SM1# 
SM1ACT# 
STPOX* 



TCK TDt TOO TMS TRST# 



TTTTT 



Data 
and 
Data 
Parny 



Inquire 
Cydes 



Floating-Point 
Error Handling 



External 
Interrupts, 
SMM,R^and 
Ntiaiization 



JTAGliest 



Figure 51. Logic Symbol Diagram 
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The following terminology is used in this chapter: 

■ Driven — The processor actively pulls the signal up to the 
High-voltage state or pulls the signal down to the 
Low-voltage state. 

■ Floated — The the signal is not being driven by the processor 
(high-impedance state), which allows another device to 
drive this signal. 

■ Asserted — For all active-High signals, the term asserted 
means the signal is in the High-voltage state. For all 
active-Low signals, the term asserted means the signal is in 

the Low- voltage state. 

■ Negated — For all active-High signals, the term negated 
means the signal is in the Low-voltage state. For all 
active-Low signals, the term negated means the signal is in 
the High-voltage state. 

■ Sampled — The processor has measured the state of a signal 
at predefined points in time and will take the appropriate 
action based on the state of the signal. If a signal is not 
sampled by the processor, its assertion or negation has no 
effect on the operation of the processor. 
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A20M# (Address Bit 20 Mask) 



Input 

Summary A20M# is used to simulate the behavior of the 8086 when 

running in Real mode. The assertion of A20M# causes the 
processor to force bit 20 of the physical address to 0 prior to 
accessing the cache or driving out a memory bus cycle. The 
clearing of address bit 20 maps addresses that extend above the 
8086 1-Mbyte limit to below 1 Mbyte. 

Sampled The processor samples A20M# as a level-sensitive input on 

every clock edge. The system logic can drive the signal either 
synchronously or asynchronously. If it is asserted 
asynchronously, it must be asserted for a minimum pulse width 
of two clocks. 

The following list explains the effects of the processor sampling 
A20M# asserted under various conditions: 

■ Inquire cycles and writeback cycles are not affected by the 
state of A20M#. 

■ The assertion of A20M# in System Management Mode 
(SMM) is ignored. 

■ When A20M# is sampled asserted in Protected mode, it 
causes unpredictable processor operation. A20M# is only 
defined in Real mode. 

■ To ensure that A20M# is recognized before the first ADS# 
occurs following the negation of RESET, A20M# must be 
sampled asserted on the same clock edge that RESET is 
sampled negated or on one of the two subsequent clock 
edges. 

■ To ensure A20M# is recognized before the execution of an 
instruction, a serializing instruction must be executed 
between the instruction that asserts A20M# and the 
targeted instruction. 
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Ap1:3l (Address Bus) 

A[31:5] Bidirectional A[4:3] Output 

Summary A[31:3] contain the physical address for the current bus cycle. 

The processor drives addresses on A[31:3] during memory and 
I/O cycles, and cycle definition information during special bus 
cycles. The processor samples addresses on A[31:S] during 
inquire cycles. 

As Outputs: A[31:3] are driven valid off the same clock edge as 
ADS# and remain in the same state until the clock edge on 
vyrhich NA# or the last expected BRDY# of the cycle is sampled 
asserted. A[31:3] are driven during memory cycles, I/O cycles, 
special bus cycles, and interrupt acknowledge cycles. The 
processor continues to drive the address bus while the bus is 
idle. 



Driven, Sampled, and 
Floated 



As Inputs: The processor samples A[31:5] during inquire cycles 
on the clock edge on which EADS# is sampled asserted. Even 
though A4 and A3 are not used during the inquire cycle, they 
must be driven to a valid state and must meet the same timings 
asA[31:5]. 

A[31:3] are floated off the clock edge that AHOLD or BOFF# is 
sampled asserted and off the clock edge that the processor 
asserts HLDA in recognition of HOLD. 

The processor resumes driving A[31:3] off the clock edge on 
which the processor samples AHOLD or BOFF# negated and off 
the clock edge on which the processor negates HLDA. 
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ADS# (Address Strobe) 



Output 

Summary The assertion of ADS# indicates the beginning of a new bus 

cycle. The address bus and all cycle definition signals 
corresponding to this bus cycle are driven valid off the same 
clock edge as ADS#. 

Driven aad Floated ADS# is asserted for one clock at the beginning of each bus 

cycle. For non-pipelined cycles, ADS# can be asserted as early 
as the clock edge after the clock edge on which the last 
expected BRDY# of the cycle is sampled asserted, resulting in a 
single idle state between cycles. For pipelined cycles if the 
processor is prepared to start a new cycle, ADS # can be asserted 
as early as one clock edge after NA#i5 sampled asserted. 

If AHOLD is sampled asserted, ADS# is only driven in order to 
perform a writeback cycle due to an inquire cycle that hits a 
modified cache line. 

The processor floats ADS# off the clock edge that BOFF#is 
sampled asserted and off the clock edge that the processor 
asserts HLDA in recognition of HOLD. 



APSC# (Address Strobe Copy) 

Output 

Summary ADSC# has the identical function and timing as ADS#. In the 

event ADS # becomes too heavily loaded due to a large fanout in 
a system, ADSC# can be used to split the load across two 
outputs, which can improve system timing. 
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Input 

Sommaiy AHOLD can be asserted by the system to initiate one or more 

inquire cycles. To allow the system to drive the address bus 
during an inquire cycle, the processor floats A[31:3] and AP off 
the clock edge on which AHOLD is sampled asserted. The data 
bus and all other control and status signals remain under the 
control of the processor and are not floated. This allows a bus 
cycle that is in progress when AHOLD is sampled asserted to 
continue to completion. The processor resumes driving the 
address bus off the clock edge on which AHOLD is sampled 
negated. 

If AHOLD is sampled asserted, ADS # is only asserted in order 
to perform a writeback cycle due to an inquire cycle that hits a 
modified cache line. 

Sampled The processor samples AHOLD on every clock edge. AHOLD is 

recognized while INIT and RESET are sampled asserted. 
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AP (Address Parity) 



Summary 



Driven, Sampled, and 
Floated 



Bidirectional 

AP contains the even parity bit for cache line addresses driven 
and sampled on A[31:5]. Even parity means that the total 
number of 1 bits on AP and A[31:5] is even. (A4 and A3 are not 
used for the generation or checking of address parity because 
these bits are not required to address a cache line.) AP is driven 
by the processor during processor-initiated cycles and is 
sampled by the processor during inquire cycles. If AP does not 
reflect even parity during an inquire cycle, the processor 
asserts APCHK# to indicate an address bus parity check. The 
processor does not take an internal exception as the result of 
detecting an address bus parity check, and system logic must 
respond appropriately to the assertion of this signal. 

As an Output: The processor drives AP valid off the clock edge 
on which ADS # is asserted until the clock edge on which NA# or 
the last expected BRDY# of the cycle is sampled asserted. AP is 
driven during memory cycles, I/O cycles, special bus cycles, and 
interrupt acknowledge cycles. The processor continues to drive 
AP while the bus is idle. 



As an Input: The processor samples AP during inquire cycles on 
the clock edge on which £ADS#is sampled asserted. 

The processor floats AP off the clock edge that AHOLD or 
BOFF#is sampled asserted and off the clock edge that the 
processor asserts HLDA in recognition of HOLD. 

The processor resumes driving AP off the clock edge on which 
the processor samples AHOLD or BOFF# negated and off the 
clock edge on which the processor negates HLDA. 
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APCHK# (Address Parity Check) 



Output 

Summary If the processor detects an address parity error during an 

inquire cycle, APCHK# is asserted for one clock. The processor 
does not take an internal exception as the result of detecting an 
address bus parity check, and system logic must respond 
appropriately to the assertion of this signal. 

The processor ensures that APCHK# does not glitch, enabling 
the signal to be used as a clocking source for system logic. 

Driven APCHK# is driven valid off the clock edge after the clock edge 

on which the processor samples E ADS # asserted. It is negated 
off the next clock edge. 

AFCHK#is always driven except in the Tri-State Test mode. 



145 



1 77 AM D0060 181 



BE[7:0]# (Byte Enables) 



Output 

Summary BE[7:0]# are used by the processor to indicate the valid data 

bytes during a write cycle and the requested data bytes during 
a read cycle. The byte enables can be used to derive address bits 
A[2:0], which are not physically part of the processor's address 
bus. The processor checks and generates valid data parity for 
the data bytes that are valid as defined by the byte enables. The 
eight byte enables correspond to the eight bytes of the data bus 
as follows: 

■ BE7#: D[63:56] ■ BE3#: D[31:24] 

■ BE6#: D[55:48] ■ BE2#: D[23:16] 

■ BE5#: D[47:40] ■ BE1#: D[15:8] 

■ BE4#: D[39:32] ■ BE0#:D[7:0] 

The processor expects data to be driven by the system logic on 
all eight bytes of the data bus during a burst cache-line read 
cycle, independent of the byte enables that are asserted. 

The byte enables are also used to distinguish between special 
bus cycles as defined in Table 44 on page 186. 

Myen and Floated BE[7:0]# are driven off the same clock edge as ADS # and 
remain in the same state until the clock edge on which NA# or 
the last expected BRDY# of the cycle is sampled asserted. 
BE[7:0]# are driven during memory cycles, I/O cycles, special 
bus cycles, and interrupt acknowledge cycles. 

The processor floats BE[7:01# off the dock edge that BOFF#is 
sampled asserted and off the clock edge that the processor 
asserts HLDA in recognition of HOLD. Unlike the address bus, 
BE[7:0]# are not floated in response to AHOLD. 
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BF[2:0l (Bus Freq uency) 

bipab. Internal Pullups 

Summary BF[2:0] determine the internal operating frequency of the 

processor. The frequency of the CLK input signal is multiplied 
internally by a ratio determined by the state of these signals as 
defined in Table 37. BF[2:0] have weak internal pullups and 
default to the 3.5 multiplier if left unconnected. 



Table 37 Processor-to-Bus Clock Ratios 



State of BFUrO] Inputs 


Processor-Clod( to Bus-Clock RatM 


110b 


2.0X 


100b 


2.5X 


101b 


3.0X 


inb 


3.5X 


OlOb 


4.0X 


000b 


4.5X 


001b 


5.0X 


011b 


5J5x 



Sampled BF[2:0] are sampled during the falling transition of RESET. 

They must meet a minimum setup time of 1.0 ms and a 
minimum hold time of two clocks relative to the negation of 
RESET. 
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BOFF# (Backoff) 



Input 

Summary If BOFF# is sampled asserted, the processor unconditionally 

aborts any cycles in progress and transitions to a bus hold state 
by floating the following signals: A[31:3], ADS#, ADSC#, AP, 
BE[7:0]#, CACHE*, D[63:0], D/C#, DP[7:0], LOCK#, M/IO#, 
PCD, PWT, SCYC, and W/R#. These signals remain floated until 
BOFF#is sampled negated. This allows an alternate bus master 
or the system to control the bus- 
When BOFF# is sampled negated, any processor cycle that was 
aborted due to the assertion of BOFF# is restarted from the 
beginning of the cycle, regardless of the number of transfers 
that were completed. If BOFF#is sampled asserted on the same 
clock edge as BRDY# of a bus cycle of any length, then BOFF# 
takes precedence over the BRDY#. In this case, the cycle is 
aborted and restarted after BOFF#is sampled negated. 

Sampled BOFF# is sampled on every clock edge. The processor floats its 

bus signals off the clock edge on which BOFF#is sampled 
asserted. These signals remain floated until the clock edge on 
which BOFF#is sampled negated. 

BOFF#is recognized while INIT and RESET are sampled 
asserted. 
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BRDY# (Burst Ready) 



Input, Internal Pullup 

Summary BRDY# is asserted to the processor by system logic to indicate 

either that the data bus is being driven with valid data during a 
read cycle or that the data bus has been latched during a write 
cycle. If necessary, the system logic can insert bus cycle wait 
states by negating BRDY# until it is ready to continue the data 
transfer. BRDY# is also used to indicate the completion of 
special bus cycles. 

Sampled BRDY# is sampled every clock edge vdthin a bus cycle starting 

with the clock edge after the clock edge that negates ADS#. 
BRDY#is ignored while the bus is idle. The processor samples 
the following inputs on the clock edge on which BRDY# is 
sampled asserted: D[63:0], DP[7:0], and KEN# during read 
cycles, EWBE# during write cycles, and WBWT# during read 
and write cycles. If NA# is sampled asserted prior to BRDY#, 
then KEN# and WB/WT# are sampled on the clock edge on 
which NA# is sampled asserted* 

The number of times the processor expects to sample BRDY# 
asserted depends.on the type of bus cycle, as follows: 

■ One time for a single-transfer cycle, a special bus cycle, or 
each of two cycles in an interrupt acknowledge sequence 

■ Four times for a burst cycle (once for each data transfer) 

BRDY# can be held asserted for four consecutive clocks 
throughout the four transfers of the burst, or it can be negated 
to insert wait states. 



149 



177AMD0060185 




r Signal Descriptions 




BRDYC# (Burst Ready Copy) 



Input, Internal Pallup 



Summary 



Sampled 



150 



BRDYC#has the identical function as BRDY#, In the event 
BRDY# becomes too heavily loaded due to a large fanout or 
loading in a system, BRDYC# can be used to reduce this 
loading, which improves timing. 

In addition, BRDYC#is sampled when RESET is negated to 
configure the drive strength of A[20:3], ADS#, HITM#, and 
W/R#. If BRDYC#is 0 during the falling transition of RESET, 
these particular outputs are configured using higher drive 
strengths than the standard strength. If BRDYC#is 1 during the 
faUing transition of RESET, the standard strength is selected. 

BRDYC#is sampled every clock edge within a bus cycle starting 
with the clock edge after the clock edge that negates ADS#. 

BRDYC# is also sampled during the falling transition of RESET. 
If RESET is driven synchronously, BRDYC# must meet the 
specified hold time relative to the negation of RESET. If 
RESET is driven asynchronously, the minimum setup and hold 
time for BRDYC# relative to the negation of RESET is two 
clocks. 
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BREQ (Bus Req uest) _ 

Output 

Sommwy BREQ is asserted by the processor to request the bus in order to 

complete an internally pending bus cycle. The system logic can 
use BREQ to arbitrate among the bus participants. If the 
processor does not own the bus, BREQ is asserted until the 
processor gains access to the bus in order to begin the pending 
cycle or until the processor no longer needs to run the pending 
cycle. If the processor currently owns the bus, BREQ is asserted 
with ADS#. The processor asserts BREQ for each assertion of 
ADS # but does not necessarily assert ADS # for each assertion of 
BREQ. 

Driven BREQ is asserted off the same clock edge on which ADS#is 

asserted. BREQ can also be asserted off any clock edge> 
independent of the assertion of ADS#. BREQ can be negated 

one clock edge after it is asserted. 

The processor always drives BREQ except in the Tri-State Test 
mode. 
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Signal Descriptions 



CACHE# (Cacheable Access) 



Summary 



Driven and Floated 



Output 

For reads, CACHE # is asserted to indicate the cacheability of 
the current bus cycle» In addition, if the processor samples 
KEN# asserted, which indicates the driven address is 
cacheable, the cycle is a 32-byte burst read cycle. For write 
cycles, CACHE # is asserted to indicate the current bus cycle is a 
modified cache-line writeback. KEN# is ignored during 
writebacks. If CACHE # is not asserted, or if KEN# is sampled 
negated during a read cycle, the cycle is not cacheable and 
defaults to a single-transfer cycle. 

CACHE # is driven off the same clock edge as ADS # and remains 
in the same state until the clock edge on which NA# or the last 
expected BRDY#of the cycle is sampled asserted. 

CACHE* is floated off the clock edge that BOFF# is sampled 
asserted and off the clock edge that the processor asserts HLDA 
in recognition of HOLD. 



CLK (Clock) 



Input 



Summary The CLK signal is the bus clock for the processor and is the 

reference for all signal timings under normal operation (except 
for TDI, TDO, IMS, and TRST#). BF[2:0] determine the internal 
frequency multiplier applied to CLK to obtain the processor's 
core operating frequency. See "BF[2:0] (Bus Frequency)" on 
page 147 for a list of the processor-to-bus clock ratios. 

Sampled The CLK signal must be stable a minimum of 1.0 ms prior to the 

negation of RESET to ensure the proper operation of the 
processor. See "CLK Switching Characteristics'* on page 314 for 
details regarding the CLK specifications. 
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Signal Descriptions 




D/C# (Data/Code) 



Output 

Summaiy The processor drives D/C# during a memory bus cycle to 

indicate whether it is addressing data or executable code. D/C# 
is also used to define other bus cycles, including interrupt 
acknowledge and special cycles. See Table 44 on page 186 for 

more details. 

Driven and Floated D/C# is driven off the same clock edge as ADS# and remains in 
the same state until the clock edge on which NA# or the last 
expected BRDY# of the cycle is sampled asserted. D/C# is 
driven diuing memory cycles, I/O cycles, special bus cycles, and 
interrupt acknowledge cycles. 

D/C# is floated off the clock edge that BOFF# is sampled 
asserted and off the clock edge that the processor asserts HLDA 
in recognition of HOLD. 
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Signal Descriptions 



D[63:0] (Data Bus) 



Bidirectional 

Summary D[63:0] represent the processor's 64-bit data bus. Each of the 

eight bytes of data that comprise this bus is qualified as valid 
by its corresponding byte enable. See "BE[7:01# (Byte 
Enables)" on page 146. 

Driven, Sampled, mid As Outputs: For single-transfer write cycles, the processor drives 
Floated D[63:0] with valid data one clock edge after the clock edge on 

which ADS#is asserted and D[63:0] remain in the same state 
until the clock edge on which BRDY#is sampled asserted. If the 
cycle is a writeback — in which case four, 8-byte transfers 
occur — D[63:0] are driven one clock edge after the clock edge 
on which ADS# is asserted and are subsequently changed off 
the clock edge on which each BRDY# assertion of the burst 
cycle is sampled. 

If the assertion of ADS # represents a pipelined write cycle that 
follows a read cycle, the processor does not drive D[63:0] until it 
is certain that contention on the data bus will not occur. In this 
case, D[63:0] are driven the clock edge after the last expected 
BRDY# of the previous cycle is_sampled asserted. , — 

As Inputs: During read cycles, the processor samples D[63:0] on 
the clock edge on which BRDY#is sampled asserted. 

The processor always floats D[63:0] except when they are being 
driven during a write cycle as described above. In addition, 
D[63:0] are floated off the clock edge that BOFF#is sampled 
asserted and off the clock edge that the processor asserts 
HLDA in recognition of HOLD. 
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Signal Descriptions 




DP[7:0] (Data Parity) 



Bidirectional 

Summary DP[7:0] are even parity bits for each valid byte of data — as 

defined by BE[7:0]# — driven and sampled on the D[63:0] data 
bus. Even parity means that the total number of 1 bits within 
each byte of data and its respective data parity bit is an even 
number. DP[7:0] are driven by the processor during write cycles 
and sampled by the processor during read cycles. If the 
processor detects bad parity on any valid byte of data during a 
read cycle, PCHK#is asserted for one clock beginning the clock 
edge after BRDY#is sampled asserted. The processor does not 
take an internal exception as the result of detecting a data 
parity check, and system logic must respond appropriately to 
the assertion of this signal. 

The eight data parity bits correspond to the eight bytes of the 
data bus as follows: 

■ DP7:D[63:56] ■ DPS: D[31;24] 

■ DP6:D[55:48] ■ DP2: D[23:16] 

■ DP5:D[47:40] ■ DPI: D[15:8] 

■ DP4: D1 39:32] ■ DPO: D[7:0] 



For systems that do not support data parity, DP[7:0] should be 
connected to Yqcs through puUup resistors. 

Driven, Sampled, and As Outputs: For single-transfer write cycles, the processor drives 
floated DP[7:0] with valid parity one clock edge after the clock edge on 

which ADS# is asserted and DP[7:0] remain in the same state 
until the clock edge on which BRDY# is sampled asserted. If the 
cycle is a writeback, DP[7:0] are driven one clock edge after the 
clock edge on which ADS# is asserted and are subsequently 
changed off the clock edge on which each BRDY# assertion of 
the burst cycle is sampled. 

As Inputs: During read cycles, the processor samples DP[7:01 on 
the clock edge BRDY#is sampled asserted. 



The processor always floats DP[7:0] except when they are being 
driven during a write cycle as described above. In addition, 
DPI7:0] are floated off the clock edge that BOFF# is sampled 
asserted and off the clock edge that the processor asserts 
HLDA in recognition of HOLD. 
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Signal Descriptions 



EADS# (External Address Strobe) 



Nummary 



Sampled 



Input 

System logic asserts EADS# during a cache inquire cycle to 
indicate that the address bus contains a valid address. EADS# 
can only be driven after the system logic has taken control of 
the address bus by asserting AHOLD or BOFF# or by receiving 
HLDA. The processor responds to the sampling of EADS# and 
the address bus by driving HIT#, which indicates if the inquired 
cache line exists in the processor's cache, and HITM#, which 
indicates if it is in the modified state. 

If AHOLD or BOFF# is asserted by the system logic in order to 
execute a cache inquire cycle, the processor begins sampling 
EADS# two clock edges after AHOLD or BOFF#is sampled 
asserted. If the system logic asserts HOLD in order to execute a 
cache inquire cycle, the processor begins sampling E ADS # two 
clock edges after the clock edge HLDA is asserted by the 
processor. 

EADS#is ignored during the following conditions: 

■ One clock edge after the clock edge on which EADS# is 
sampled asserted 

■ TWo clock edges after the clock edge on which ADS# is 
asserted 

■ When the processor is driving the address bus 

■ When the processor asserts HITM# 
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Signal Descriptions 




EWBE# (External Write Buffer Empty) 



Input 



Summary 



Sampled 



The system logic can negate EWBE# to the processor to indicate 
that its external write buffers are full and that additional data 
cannot be stored at this time. This causes the processor to delay 
the following activities until EWBE# is sampled asserted: 

■ The commitment of write hit cycles to cache lines in the 
modified state or exclusive state in the processor's cache 

■ The decode and execution of an instruction that follows a 
currently-executing serializing instruction 

■ The assertion or negation of SMIACT# 

■ The entering of the Halt state and the Stop Grant state 

Negating EWBE# does not prevent the completion of any type 
of cycle that is currently in progress. 

The processor samples EWBE# on each clock edge that BRDY# 
is sampled asserted during all memory write cycles (except 
writeback cycles), I/O write cycles, and special bus cycles. 

If EWBE#is sampled negated, it is sampled on every clock edge 
until it is asserted, and then it is ignored until BRDY#is 
sampled asserted in the next write cycle or special cycle. 
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C S'gnal Descriptions 



FERR# (Floating-Point Error) 



Output 

Summary The assertion of FERR# indicates the occurrence of an 

unmasked floating-point exception resulting from the 
execution of a floating-point instruction. This signal is provided 
to allow the system logic to handle this exception in a manner 
consistent with IBM-compatible PC/AT systems. See "Handling 
Floating-Point Exceptions" on page 254 for a system logic 
implementation that supports floating-point exceptions. 

The state of the numeric error (NE) bit in CRO does not affect 
the F£RR# signal. 

The processor ensures that FERR# does not glitch, enabling the 
signal to be used as a clocking source for system logic. 

Driven The processor asserts FERR# on the instruction boundary of 

the next floating-point instruction, MMX instruction, or WAIT 
instruction that occurs following the floating-point instruction 
that caused the unmasked floating-point exception — that is, 
FERR# is not asserted at the time the exception occurs. The 
IGNNE# signal does not affect the assertion of FERR#. 

FERR#is negated during the following conditions: 

■ Following the successful execution of the floating-point 
instructions FCLEX, FINIT, FSAVE, and FSTENV 

■ Under certain circimastances, following the successful 
execution of the floating-point instructions FLDCW, 
FLDENV, and FRSTOR, which load the floating-point status 
word or the floating-point control word 

■ Following the falling transition of RESET 

FERR # is always driven except in the Tri-State Test mode. 

See "IGNNE# (Ignore Numeric Exception)** on page 163 for 
more details on floating-point exceptions. 
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Signal Descriptions K 
FLUSH# (Cache Flush) 

Input 

Summary In response to sampling FLUSH # asserted, the processor writes 

back any data cache lines that are in the modified state, 
invalidates all lines in the instruction and data caches, and then 
executes a flush acknowledge special cycle. See Table 44 on 
page 186 for the bus definition of special cycles. 

In addition, FLUSH # is sampled when RESET is negated to 
determine if the processor enters the Tri-State Test mode. If 
FLUSH# is 0 during the falling transition of RESET, the 
processor enters the Tri-State Test mode instead of performing 
the normal RESET functions. 

Sampled FLUSH# is sampled and latched as a falling edge-sensitive 

signal. During normal operation (not RESET), FLUSH# is 
sampled on every clock edge but is not recognized until the next 
instruction boundary. If FLUSH# is asserted ssmchronously, it 
can be asserted for a minimum of one clock. If FLUSH# is 
asserted asynchronously, it must have been negated for a 
minimum of two clocks, followed by an assertion of a minimum 
of two clocks. 

FLUSH#is also sampled during the falling transition of RESET. 
If RESET and FLUSH# are driven synchronously, FLUSH#is 
sampled on the clock edge prior to the clock edge on which 
RESET is sampled negated* If RESET is driven asynchronously, 
the minimum setup and hold time for FLUSH#, relative to the 
negation of RESET, is two clocks. 
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Signal Descriptions 



HIT# (Inquire Cycle Hit) 



Summary 



Driven 



Output 

The processor asserts HIT# during an inquire cycle to indicate 
that the cache line is vahd within the processor's instruction or 
data cache (also known as a cache hit). The cache line can be in 
the modified, exclusive, or shared state. 

HIT#is always driven — except in the Tri-State Test mode — and 
only changes state the clock edge after the clock edge on which 
EAI)S# is sampled asserted. It is driven in the same state until 
the next inquire cycle. 



HITM# (Inquire Cycle Hit To Modified Line) 



Output 

Summary The processor asserts HITM# during an inquire cycle to 

indicate that the cache line exists in the processor's data cache 
in the modified state. The processor performs a writeback cycle 
as a result of this cache hit. If an inquire cycle hits a cache Une 
that is currently being written back, the processor asserts 
HITM# but does not execute another writeback cycle. The 
system logic must not expect the processor to assert ADS# each 
time HITM# is asserted. 

Driven HITM# is always driven — except in the Tri-State Test mode — 

and, in particular, is driven to represent the result of an inquire 
cycle the clock edge after the clock edge on which EADS# is 
sampled asserted. If HITM# is negated in response to the 
inquire address, it remains negated until the next inquire cycle. 
If HITM# is asserted in response to the inquire address, it 
remains asserted throughout the writeback cycle and is negated 
one clock edge after the last BRDY# of the writeback is 
sampled asserted. 
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HLDA (Hold Acknowledge) 



Output 

Summorf When HOLD is sampled asserted, the processor completes the 

current bus cycles, floats the processor bus, and asserts HLDA 
in an acknowledgment that these events have been completed. 
The processor does not assert HLDA until the completion of a 
locked sequence of cycles. While HLDA is asserted, another bus 
master can drive cycles on the bus, including inquire cycles to 
the processor. The following signals are floated when HLDA is 
asserted: A[31:3], ADS#, ADSC#, AP, BE[7:0]#, CACHE#, 
D[63:0], D/C#, DP[7:0], LOCK#, M/IO#, PCD, PWT, SCYC, and 
W/R#. 

The processor ensures that HLDA does not glitch. 

Driven HLDA is always driven except in the Tri-State Test mode. If a 

processor cycle is in progress while HOLD is sampled asserted, 
HLDA is asserted one clock edge after the last BRDY# of the 
cycle is sampled asserted. If the bus is idle, HLDA is asserted 
one clock edge after HOLD is sampled asserted. HLDA is 
negated one clock edge after the clock edge on which HOLD is 
sampled negated. 

The assertion of HLDA is independent of the sampled state of 
BOFF#. 

The processor floats the bus every clock in which HLDA is 
asserted. 
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J[ Signal Descriptions 
HOLD (Bus Hold Request) ' 

Input 

Summary The system logic can assert HOLD to gain control of the 

processor's bus. When HOLD is sampled asserted, the processor 
completes the current bus cycles, floats the processor bus, and 
asserts HLDA in an acknowledgment that these events have 
been completed. 

Sampled The processor samples HOLD on every clock edge. If a 

processor cycle is in progress while HOLD is sampled asserted, 
HLDA is asserted one clock edge after the last BRDY# of the 
cycle is sampled asserted. If the bus is idle, HLDA is asserted 
one clock edge after HOLD is sampled asserted. HOLD is 
recognized while INIT and RESET are sampled asserted. 
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ICNNE# (Ignore Numeric Exception) 

Input 

Summary IGNNE#, in conjunction with the numeric error (NE) bit in CRO, 

is used by the system logic to control the effect of an unmasked 
floating-point exception on a previous floating-point instruction 
during the execution of a floating-point instruction, MMX 
instruction, or the WAIT instruction — hereafter referred to as 
the target instruction. 

If an unmasked floating-point exception is pending and the 
target instruction is considered error-sensitive, then the 
relationship between NE and IGNNE#is as follows: 

■ IfNE = 0,then: 

• If IGNNE# is sampled asserted, the processor ignores the 
floating-point exception and continues with the 
execution of the target instruction. 

• If IGNNE#is sampled negated, the processor waits until 
it samples IGNNE#, INTR, SMI#, NMI, or INIT asserted. 

If IGNNE# is sampled asserted while waiting, the 
processor ignores the floating-point exception and 
continues with the execution of the target instruction. 

If INTR, SMI#, NMI, or INK is sampled asserted while 
waiting, the processor handles its assertion 
appropriately. 

■ If NE = 1, the processor invokes the INT lOh exception 
handler. 

If an unmasked floating-point exception is pending and the 
target instruction is considered error-insensitive, then the 
processor ignores the floating-point exception and continues 
with the execution of the target instruction. 

FERR# is not affected by the state of the NE bit or IGNNE#, 
FERR# is always asserted at the instruction boundary of the 
target instruction that follows the floating-point instruction 
that caused the unmasked floating-point exception. 

This signal is provided to allow the system logic to handle 
exceptions in a manner consistent with IBM-compatible PC/AT 
systems. 
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Signal Descriptions 



Sampled 



The processor samples IGNNE# as a level-sensitive input on 
every clock edge. The system logic can drive the signal either 
synchronously or asynchronously. If it is asserted 
asynchronously, it must be asserted for a minimum pulse width 
of two clocks. 



INIT (Initialization) 



Input 

Summary The assertion of INIT causes the processor to empty its 

pipelines, to initialize most of its internal state, and to branch 
to address FFFF_FFFOh — the same instruction execution 
starting point used after RESET. Unlike RESET, the processor 
preserves the contents of its caches, the floating-point state, the 
MMX state, model-specific registers, the CD and NW bits of the 
CRO register, and other specific internal resources. 

ENIT can be used as an accelerator for 80286 code that requires 
a reset to exit from Protected mode back to Real mode. 

Sampled INIT is sampled and latched as a rising edge-sensitive signal. 

INIT is sampled on every clock edge but is not recognized until 
the next instruction boundary. During an I/O write cycle, it must 
be sampled asserted a minimum of three clock edges before 
BRDY# is sampled asserted if it is to be recognized on the 
boundary between the I/O write instruction and the following 
instruction. 

If INIT is asserted synchronously, it can be asserted for a 
minimum of one clock. If it is asserted asynchronously, it must 
have been negated for a minimum of two clocks, followed by an 
assertion of a minimum of two clocks. 
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Signal Descriptions 




INTR (Maskable Interrupt) 



Input 



Summary 



Sampled 



INTR is the system's maskable interrupt input to the processor. 
When the processor samples and recognizes INTR asserted, the 
processor executes a pair of interrupt acknowledge bus cycles 
and then jumps to the interrupt service routine specified by the 
interrupt number that was returned during the interrupt 
acknowledge sequence. The processor only recognizes INTR if 
the interrupt flag (IF) in the EFLAGS register equals 1. 

The processor samples INTR as a level-sensitive input on every 
clock edge, but the interrupt request is not recognized until the 
next instruction boundary. The system logic can drive INTR 
either synchronously or asynchronously. If it is asserted 
asynchronously, it must be asserted for a minimum pulse width 
of two clocks. In order to be recognized, INTR must remain 
asserted until an interrupt acknowledge sequence is complete. 



During an inquire cycle, the state of INV determines whether 
an addressed cache line that is found in the processor's 
instruction or data cache transitions to the invalid state or the 
shared state. 

If INV is sampled asserted during an inquire cycle, the 
processor transitions the cache line (if found) to the invalid 
state, regardless of its previous state. If INV is sampled negated 
during an inquire cycle, the processor transitions the cache line 
(if found) to the shared state. In either case, if the cache line is 
found in the modified state, the processor writes it back to 
memory before changing its state. 

INV is sampled on the clock edge on which EADS# is sampled 
asserted. 



INV (Invalidation Request) 



Input 
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KEN# (Cache Enable) 



input 

Summary If KEN# is sampled asserted, it indicates that the address 

presented by the processor is cacheable. If KEN# is sampled 
asserted and the processor intends to perform a cache-line fill 
(signified by the assertion of CACHE*), the processor executes 
a 32-byte burst read cycle and expects to sample BRDY# 
asserted a total of four times. If KEN# is sampled negated 
during a read cycle, a single-transfer cycle is executed and the 
processor does not cache the data. For write cycles, CACHE#is 
asserted to indicate the current bus cycle is a modified 
cache-line writeback. KEN#is ignored during writebacks. 

If PCD is asserted during a bus cycle, the processor does not 
cache any data read during that cycle, regardless of the state of 
KEN#. See "PCD (Page Cache Disable)" on page 171 for more 
details. 

If the processor has sampled the state of KEN# during a cycle, 
and that cycle is aborted due to the sampling of BOFF# 
asserted, the system logic must ensure that KEN#is sampled in 
the same state when the processor restarts the aborted cycle. 

Sampled KEN# is sampled on the clock edge on which the first BRDY # or 

NA# of a read cycle is sampled asserted. If the read cycle is a 
burst, KEN# is ignored during the last three assertions of 
BRDY#. KEN# is sampled during read cycles only when 
CACHE # is asserted. 
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Signal Descriptions 




LOCK# (Bus Lock) 



Output 

Summary The processor asserts LOCK# during a sequence of bus cycles to 

ensure that the cycles are completed without allowing other bus 
masters to intervene. Locked operations consist of two to five 
bus cycles. LOCK# is asserted during the following operations: 

■ An interrupt acknowledge sequence 

■ Descriptor Table accesses 

■ Page Directory and Page Table accesses 

■ XCHG instruction 

■ An instruction with an allowable LOCK prefix 

In order to ensure that locked operations appear on the bus and 
are visible to the entire system, any data operands addressed 
during a locked cycle that reside in the processor's cache are 
flushed and invalidated from the cache prior to the locked 
operation. If the cache line is in the modified state, it is written 
back and invalidated prior to the locked operation. Likewise, 
any data read during a locked operation is not cached. 

The processor ensures that LOCK# does not glitch. 

Driven aad Flaated During a locked cycle, LOCK# is asserted off the same clock 
edge on which ADS# is asserted and remains asserted until the 
last BRDY# of the last bus cycle is sampled asserted. The 
processor negates LOCK# for at least one clock between 
consecutive sequences of locked operations to allow the system 
logic to arbitrate for the bus. 

LOCK# is floated off the clock edge that BOFF# is sampled 
asserted and off the clock edge that the processor asserts HLDA 
in response to HOLD. When LOCK# is floated due to BOFF# 
sampled asserted, the system logic is responsible for preserving 
the lock condition while LOCK# is in the high-impedance state. 
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Output 

Summary The processor drives M/IO# during a bus cycle to indicate 

whether it is addressing the memory or I/O space. If M/IO# = 1, 
the processor is addressing memory or a memory-mapped I/O 
port as the result of an instruction fetch or an instruction that 
loads or stores data. If M/IO# = 0, the processor is addressing an 
I/O port during the execution of an I/O instruction. In addition, 
M/IO# is used to define other bus cycles, including interrupt 
acknowledge and special cycles. See Table 44 on page 186 for 
more details. 

fln'vefl aad Floated M/IO# is driven off the same clock edge as ADS# and remains in 

the same state until the clock edge on which NA# or the last 
expected BRDY# of the cycle is sampled asserted. M/IO# is 
driven during memory cycles, I/O cycles, special bus cycles, and 
interrupt acknowledge cycles. 

M/IO# is floated off the clock edge that BOFF# is sampled 
asserted and off the clock edge that the processor asserts HLDA 
in response to HOLD. 
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Signal Descriptions 




NA# (Next Address) 



Input 

SummarY System logic asserts NA# to indicate to the processor that it is 

ready to accept another bus cycle pipelined into the previous 
bus cycle. ADS#, along with address and status signals, can be 
asserted as early as one clock edge after NA# is sampled 
asserted if the processor is prepared to start a new cycle. 
Because the processor allows a maximum of two cycles to be in 
progress at a time, the assertion of NA# is sampled while two 
cycles are in progress but ADS# is not asserted until the 
completion of the first cycle. 

Sampled NA# is sampled every clock edge during bus cycles, starting one 

clock edge after the clock edge that negates ADS#, until the last 
expected BRDY# of the last executed cycle is sampled asserted 
(with the exception of the clock edge after the clock edge that 
negates the ADS# for a second pending cycle). Because the 
processor latches NA# when sampled, the system logic only 
needs to assert NA# for one clock. 
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NMI (Non-IWaskable Interrupt) 

Input 

Summary When NMI is sampled asserted, the processor jumps to the 

interrupt service routine defined by interrupt number 02h. 
Unlike the INTR signal, software cannot mask the effect of NMI 
if it is sampled asserted by the processor. However, NMI is 
temporarily masked upon entering System Management Mode 
(SMM). In addition, an interrupt acknowledge cycle is not 
executed because the interrupt number is predefined. 

If NMT is sampled asserted while the processor is executing the 
interrupt service routine for a previous NMI, the subsequent 
NMI remains pending until the completion of the execution of 
the IRET instruction at the end of the interrupt service routine. 

Sampled NMI is sampled and latched as a rising edge-sensitive signal. 

During normal operation, NMI is sampled on every clock edge 
but is not recognized until the next instruction boundary. If it is 
asserted synchronously, it can be asserted for a minimum of one 
clock. If it is asserted asynchronously, it must have been 
negated for a minimum of two clocks, followed by an assertion 
of a minimum of two clocks. 
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Signal Descriptions 
PCD (Page Cache Disable) 

Output 

Summary The processor drives PCD to indicate the operating system's 

specification of cacheability for the page being addressed. 
System logic can use PCD to control external caching. K PCD is 
asserted, the addressed page is not cached. If PCD is negated, 
the cacheability of the addressed page depends upon the state 
of CACHE# andKEN#. 

The state of PCD depends upon the processor's operating mode 
and the state of certain bits in its control registers and TLB as 
follows: 

■ In Real mode, or in Protected and Virtual-8086 modes while 
paging is disabled (PG bit in CRO set to 0): 

PCD output = CD bit in CRO 

■ In Protected and Virtual-8086 modes while caching is 
enabled (CD bit in CRO set to 0) and paging is enabled (PG 
bit in CRO set to 1): 

• For accesses to I/O space, page directory entries, and 
other non^paged accesses: 

PCD output = PCD bit in CR3 

• For accesses to 4-Kbyte page table entries or 4-Mbyte 
pages: 

PCD output = PCD bit in page directory entry 

• For accesses to 4-Kbyte pages: 

PCD output = PCD bit in page table entry 

Driven and Floated PCD is driven off the same clock edge as ADS# and remains in 
the same state until the clock edge on which NA# or the last 
expected BRDY# of the cycle is sampled asserted. 

PCD is floated off the clock edge that BOFF# is sampled 
asserted and off the clock edge that the processor asserts HLDA 
in response to HOLD. . 
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Signal Descriptions 



PCHK# (Parity Check) 



Output 

Summary The processor asserts PCHK# during read cycles if it detects an 

even parity error on one or more valid bytes of D[63:0] during a 
read cycle. (Even parity means that the total number of 1 bits 
within each byte of data and its respective data parity bit is 
even.) The processor checks data parity for the data bytes that 
are valid, as defined by BE[7:01#, the byte enables. 

PCHK# is always driven but is only asserted for memory and I/O 
read bus cycles and the second cycle of an interrupt 
acknowledge sequence. PCHK# is not driven during any type of 
write cycles or special bus cycles. The processor does not take 
an internal exception as the result of detecting a data parity 
error, and system logic must respond appropriately to the 
assertion of this signal. 

The processor ensures that PCHK# does not glitch, enabling the 
signal to be used as a clocking sotu-ce for system logic. 

Dmen PCHK# is always driven except in the Tri-State Test mode. For 

each BRDY# returned to the processor during a read cycle with 
a parity error detected on the data bus, PCHK# is asserted for 
one clock, one clock edge after BRDY# is sampled asserted. 
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Signal Descriptions 




PWT (Page Writethrough) 



Output 

Summary The processor drives PWT to indicate the operating system^s 

specification of the writeback state or writethrough state for 
the page being addressed. PWT, together with WB/WT#, 
specifies the data cache-line state during cacheable read misses 
and write hits to shared cache lines. See "WBA/\rT# (Writeback 
or Writethrough)" on page 183 for more details. 

The state of PWT depends upon the processor's operating mode 
and the state of certain bits in its control registers and TLB as 
follows: 

■ In Real mode, or in Protected and Virtual-8086 modes while 
paging is disabled (PG bit in CRO set to 0): 

PWT output = 0 (writeback state) 

■ In Protected and Virtual-8086 modes while paging is 
enabled (PG bit in CRO set to 1): 

• For accesses to I/O space, page directory entries, and 
other non-paged accesses: 

PWT output = PWT bit in CR3 

• For accesses to 4-Kbyte page table entries or 4-Mbyte 
pages: 

PWT output = PWT bit in page directory entry 

• For accesses to 4-Kbyte pages: 

PWT output = PWT bit in page table entry 

Driven and Floated PWT is driven off the same clock edge as ADS# and remains in 
the same state until the clock edge on which NA# or the last 
expected BRDY# of the cycle is sampled asserted. 

PWT is floated off the clock edge that BOFF# is sampled 
asserted and off the clock edge that the processor asserts HLDA 
in response to HOLD. 
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RESET (Reset) 



Input 

Summary When the processor samples RESET asserted, it immediately 

flushes and initializes all internal resources and its internal 
state including its pipelines and caches, the floating-point 
state, the MMX state, the 3D state, and all registers, and then 
the processor jumps to address FFFF_FFFOh to start 
instruction execution. 



The signals BRDYC# and FLUSH# are sampled during the 
falling transition of RESET to select the drive strength of 
selected output signals and to invoke the Tri-State Test mode, 
respectively. See these signal descriptions for more details. 

Sampled RESET is sampled as a level-sensitive input on every clock 

edge. System logic can drive the signal either synchronously or 
asynchronously. 

During the initial power-on reset of the processor, RESET must 
remain asserted for a minimum of 1.0 ms after CLK and Yqc 
reach specification before it is negated. 

During a warm reset, while CLK and Vcc ^ire within their 
specification, RESET must remain asserted for a minimum of 
15 clocks prior to its negation. 
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RSVD (Reserved) 



Sammary Reserved signals are a special class of pins that can be treated 

in one of the following ways: 

■ As no-connect (NC) pins, in which case these pins are left 
unconnected 

■ As pins connected to the system logic as defined by the 
industry-standard Pentium interface (Socket 7) 

■ Any combination of NC and Socket 7 pins 

In any case, if the RSVD pins are treated accordingly, the 
normal operation of the AMD-K6 3D processor is not adversely 
affected in any manner. 

See "Pin Designations" on page 342 for a list of the locations of 
the RSVD pins. 

SCYC (SplH Cycle) 

Output 

Summary The processor asserts SCYG during misaligned, locked transfers 

on the D[63:0] data bus. The processor generates additional bus 
cycles to complete the transfer of misaligned data. 

For purposes of bus cycles, the term aligned means: 

■ Any 1-byte transfers 

■ 2-byte and 4-byte transfers that lie within 4-byte address 
boundaries 

■ 8-byte transfers that lie within 8-byte address boundaries 

Driven and Floated SCYC is asserted off the same clock edge as ADS#, and negated 
off the clock edge on which NA# or the last expected BRDY# of 
the entire locked sequence is sampled asserted. SCYC is only 
valid during locked memory cycles. 

SCYC is floated off the clock edge that BOFF# is sampled 
asserted and off the clock edge that the processor asserts HLDA 
in response to HOLD. 
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S Signal Descriptions 

SMI# (System Management Interrupt) 



Input, Interaal Pullup 

Summary The assertion of SMI# causes the processor to enter System 

Management Mode (SMM). Upon recognizing SMI#, the 
processor performs the following actions, in the order shovm: 

1. Flushes its instruction pipelines 

2. Completes all pending and in-progress bus cycles 

3. Acknowledges the interrupt by asserting SMIACT# after 
sampling EWBE# asserted 

4. Saves the internal processor state in SMM memory 

5. Disables interrupts by clearing the interrupt flag (IF) in 
EFLAGS and disables NMI interrupts 

6. Jumps to the entry point of the SMM service routine at the 
SMM base physical address which defaults to 0003_8000h in 
SMM memory 

See Chapter 3, "System Management Mode (SMM)" on page 
257 for more details regarding SMM. 

Sampled SMI#ls sampled and liafche^^^^ edge-sensitive signal. 

SMI# is sampled on every clock edge but is not recognized until 
the next instruction boundary. If SMI# is to be recognized on 
the instruction boundary associated with a BRDY#, it must be 
sampled asserted a minimum of three clock edges before the 
BRDY# is sampled asserted. If it is asserted synchronously, it 
can be asserted for a minimum of one clock. If it is asserted 
asynchronously, it must have been negated for a minimum of 
two clocks followed by an assertion of a minimum of two clocks. 

A second assertion of SMI# while in SMM is latched but is not 
recognized until the SMM service routine is exited. 
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Signal Descriptions 




SMIACT# (System Management Interrupt Active) 



Output 

Summary The processor acknowledges the assertion of SMI# with the 

assertion of SMIACT# to indicate that the processor has 
entered System Management Mode (SMM). The system logic 
can use SMIACT# to enable SMM memory. See "SMI# (System 
Management Interrupt)" on page 176 for more details. 

See Chapter 3, "System Management Mode (SMM)** on page 
257 for more details regarding SMM. 

Driven The processor asserts SMIACT# after the last BRDY# of the last 

pending bus cycle is sampled asserted (including all pending 
write cycles) and after EWBE# is sampled asserted. SMIACT# 
remains asserted until after the last BRDy# of the last pending 
bus cycle associated with exiting SMM is sampled asserted. 

SMIACT# remains asserted during any flush, internal snoop, or 
writeback cycle due to an inquire cycle. 
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K Signal Descriptions 

STPCLK# (Stop Clock) 

Input Internal Pullup 

SumaimY The assertion of STPCLK# causes the processor to enter the 

Stop Grant state, during which the processor's internal clock is 
stopped. From the Stop Grant state, the processor can 
subsequently transition to the Stop Clock state, in which the 
bus clock CLK is stopped. Upon recognizing STPCLK#, the 
processor performs the following actions, in the order shown: 

1. Flushes its instruction pipelines 

2. Completes all pending and in-progress bus cycles 

3. Acknowledges the STPCLK# assertion by executing a Stop 
Grant special bus cycle (see Table 44 on page 186) 

4. Stops its internal clock after BRDY# of the Stop Grant 
special bus cycle is sampled asserted and after EWB£# is 
sampled asserted 

5. Enters the Stop Clock state if the system logic stops the bus 
clock CLK (optional) 

See Chapter 12, "Clock Control" on page 291 for more details 
regarding clock control. 

Sampled STPCLK# is sampled as a level-sensitive input on every clock 

edge but is not recognized until the next instruction boimdary. 
System logic can drive the signal either synchronously or 
asynchronously. If it is asserted asynchronously, it must be 
asserted for a minimum pulse width of two clocks. 

STPCLK# must remain asserted until recognized, which is 
indicated by the completion of the Stop Grant special cycle. 
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TCK (Test Clock) 

Input, internal Puihip 

Sunmary TCK is the clock for boundary-scan testing using the Test 

Access Port (TAP). See "Boundary-Scan Test Access Port 
(TAP)" on page 271 for details regarding the operation of the 
TAP controller. 

Sampled The processor always samples TCK, except while TRST# is 

asserted. 



TDI (Test Data Input) 



Input, internal Pullup 

Summary TDI is the serial test data and instruction input for 

boundary-scan testing using the Test Access Port (TAP), See 
"Boundary-Scan Test Access Port (TAP)" on page 271 for details 
regarding the operation of the TAP controller. 

Sampled The processor samples TDI on every rising TCK edge but only 

while in the Shif MR and Shif t-DR states.- 



TDO (Test Data Output) 



Output 

Summary TDO is the serial test data and instruction output for 

boundary-scan testing using the Test Access Port (TAP). See 
"Boundary-Scan Test Access Port (TAP)" on page 271 for details 
regarding the operation of the TAP controller. 

Drivea and Floated The processor drives TDO on every falling TCK edge but only 
while in the Shif t-IR and Shif t-DR states. TDO is floated at all 
other times. 
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TMS (Test Mode Select) 



Input, Internal Pullup 

Summary TMS specifies the test function and sequence of state changes 

for boundary-scan testing using the Test Access Port (TAP). See 
"Boundary-Scan Test Access Port (TAP)" on page 271 for details 
regarding the operation of the TAP controller. 

Sampled The processor samples TMS on every rising TCK edge. If TMS is 

sampled High for five or more consecutive clocks, the TAP 
controller enters its Test-Logic-Reset state, regardless of the 
controller state. This action is the same as that achieved by 
asserting TRST#. 



TRST# (Test Reset) 



Input Internal Pullup 

Summary The assertion of TRST# initializes the Test Access Port (TAP) by 

resetting its state machine to the Test-Logic-Reset state. See 
"Boundaiy^Scan Test Access Port (TAP)" on page 271 fordetails 
regarding the operation of the TAP controller. 

Sampled TRST# is a completely asynchronous input that does not 

require a minimum setup and hold time relative to TCK. See 
Table 80 on page 326 for the minimum pulse width requirement. 
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Signal Descriptions 



VCC2DET (Vcc2 Detect) 



Output 

Simimary VCC2DET is internally tied to (logic level 0) to indicate to 

the system logic that it must supply the specified dual-voltage 
requirements to the V^cz ^cci Pii^s. The Vcc2 P^^s supply 
voltage to the processor core, independent of the voltage 
supplied to the I/O buffers on the Vqq^ pins. Upon sampling 
VCC2DET Low, system logic should sample VCC2H/L# to 
identify core voltage requirements. 

Mvetl VCC2DET always equals 0 and is never floated — even during 

the Tri-State Test mode. 



VCC2H/L# (Vcci High/Low) 



Output 

Summary VCC2H/L# is internally tied to Vgs (logic level 0) to indicate to 

the system logic that it must supply the specified processor core 
voltage to the Vcc2 pins. The Vcc2 Pi^s supply voltage to the 
processor core, independent of the voltage supplied to the I/O 
buffers on the Vcc3 pins. Upon sampling VCC2DET Low to 
identify dual-voltage processor requirements, system logic 
should sample VCC2H/L# to identify the core voltage 
requirements for 2.9V and 3.2V products (High) and 2.2V 
products (Low). 

Mven VCC2H/L# always equals 0 and is never floated for 2.2V 

products — even during the Tri-State Test mode. To ensure 
proper operation for 2.9V and 3.2 V products, system logic that 
samples VCC2H/L# should design a weak pullup resistor for 
this signal. 



Table 38. Output Pin Float Conditions 



Name 


Floated At 


Note 


VCC2DET 


Always Driven 


* 


VCC2H/L# 


Always Driven 


♦ 


* All outputs except VCC2DE7, VCOH/L*, and TOO float 


during the Tri-State Test mode. 
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W/R # (Write/Read) 



Output 

Summary The processor drives W/R# to indicate whether it is performing 

a write or a read cycle on the bus. In addition, W/R# is used to 
define other bus cycles, including interrupt acknowledge and 
special cycles. See Table 44 on page 186 for more details. 

Driven and Floated W/R# is driven off the same clock edge as ADS# and remains in 
the same state until the clock edge on which NA# or the last 
expected BRDY# of the cycle is sampled asserted. W/R# is 
driven during memory cycles, I/O cycles, special bus cycles, and 
interrupt acknowledge cycles. 

W/R# is floated off the clock edge that BOFF# is sampled 
asserted and off the clock edge that the processor asserts HLDA 
in response to HOLD. 
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Signal Descriptions 




WB/WT# (Writeba ck or Writethrough) 

Input 

SammatY WBAVT#, together with PWT, specifies the data cache-line state 

during cacheable read misses and write hits to shared cache 
lines. 

If WBAAnr# = 0 or PWT = 1 during a cacheable read miss or write 
hit to a shared cache line, the accessed line is cached in the 
shared state. This is referred to as the writethrough state 
because all write cycles to this cache line are driven externally 
on the bus. 

If WBAVT# = 1 and PWT = 0 during a cacheable read miss or a 
write hit to a shared cache line, the accessed line is cached in 
the exclusive state. Subsequent write hits to the same line 
cause its state to transition from exclusive to modified. This is 
referred to as the writeback state because the data cache can 
contain modified cache lines that are subject to be written 
back — referred to as a writeback cycle — as the result of an 
inquire cycle, an internal snoop, a flush operation, or the 
WBINVD instruction. 

Sampled WBAVT# is sampled on the clock edge that the first BRDY# or 

NA# of a bus cycle is sampled asserted. If the cycle is a burst 
read, WB/WT# is ignored during the last three assertions of 
BRDY#. WB/WT# is sampled during memory read and 
non-writeback write cycles and is ignored during all other types 
of cycles. 



183 



177AMD0060219 




Table 39. Input Pin Types 



Name 


Type 


Note 


Name 


Type 


Noie 


A20M# 


AsYDchronous 


1 


IGNNE# 


Asynchronous 


1 


AHOLD 


Synchronous 




INIT 


Asynchronous 


2 


BFI2:01 


Synchronous 


4 


INTR 


Asynchronous 


1 


BOFF# 


Synchronous 




INV 


Synchronous 




BRDY# 


Synchronous 




KEN# 


Synchronous 




BRDYC# 


Synchronous 


7 


NA# 


Synchronous 




CLK 


Clock 




NMi 


Asynchronous 


2 


EADS# 


Synchronous 




RESET 


Asynchronous 


5,5 


EWBE# 


Synchronous 




SMI# 


Asynchronous 


2 


FLUSH* 


Asynchronous 


2.3 


STPCLK# 


Asynchronous 


1 


HOLD 


Synchronous 




WB/WT# 


Synchronous 




Notes: 

1 These level-sensitive signals can be asserted synchronously or asynchronousfy. To be sampled on a specific clock edoe, setup and 
hold times must be met If asserted asynchronously, they must be asserted for a minimum pulse width of two docks 

2 These edge-sensitive signals can be asserted synchronously or asynchronousfy. To be sampled on a specific dock edqe, setup and 
hold times must be met If asserted asynchronousfy, ^ey must hai/e beert negated at least two docks prior to assertton and must 
remain asserted at least two clocks. 

5. FLUSH^ is also sampled during the falling transition of RESET and con be asserted synchronously or asynchronous. To be 
sampled on a specific clock edge, setup and hold times must be met relative to the cbck edge before the dock edae on which 
RESET is sampled negated. If asserted asynchronous, FLUSfH^ must meet a minimum setup and hold time of two chds relative 
to the negation of RESET ....... 

4. BFflOJare sampled during the falling transition of RESET. They must meet a mhimum setup time of Wms and a minimum hold 
time of two dodcs relative to the negation of RESET 

5. During the initial power-on reset of the processor, RESET must remain asserted for a minimum of 10 ms after QLK and Vccf^och 
spedka6on before it is negated 

6. During a wamt reset, while CLK and V(x; are within their specification, RESET must remain asserted for a minimum of 15 docks 
prior to its negation. 

7. BRDYC^ is also sampled during the falling transition of RESET, If RESET is driven synchronously, BRD^^must meef the specified 
hold time relative to the negation of RESET. If asserted asynchronousfy, BRDYCf must meet a minimum setup and hold time of 
two docks relative to the negation of RESET. 
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Signal Descriptions 




Table 40. Output Pin Float Conditions 



NdOW 


rioaieo AT. (Non 


Noie 


Name 


rioaicO Ai* ^NOie i } 


Mnia 


A14.3J 


nLUA, AnULU, bUrrfF 


2, 3 


nUUM 


AiWays urn/cfi 






MLUA, BOrrtF 


2 


UJLKW 


nLUA, bUrrfF 


Z 


ADSC# 


HLDA, BOFF# 


2 


M/m 


HLDA, BOFF# 


2 


APCHK# 


Always Driven 




PCD 


HLDA, BOFF# 


2 


BE17:01# 


HLDA, BOFF# 


2 


PCHK# 


Always Driven 




BREQ 


Always Driven 




PWT 


HLDA, BOFF# 


2 


CACHE* 


HLDA,BOFF# 


2 


SCYC 


HLDA, BOFF# 


2 


D/C# 


HLDA, BOFF# 


2 


SMIAa# 


Always Driven 




FERR# 


Always Driven 




VCQDET 


Always Driven 




HIT# 


Always Driven 




VCQH/L* 


Always Driven 




HITM# 


Always Driven 




W/R# 


HLDA, BOFF# 


2 


Notes: 

1 All outputs except VCC2DET, VCC2H/L^, and TDO float during the Tri-State Test mode. 

2 Floated off the dock edge that BQFFf is sampled asserted and off the dock edge that HLDA is asserted. 
3. Floated off the dock edge that AHOLD is sampled asserted. 



Table 41. InpuVOutput Pin Float Conditions 



Name 


Floated At: (Note 1) 


Note 


A[31:51 


HLDA, AHOLD, BOFF# 


2,3 


AP 


HLDA, AHOLD, BOFF# 


2,3 


D[63:0] 


HLDA,BOFF# 


2 


DP[7:0] 


HLDA, BOFF# 


2 


Notes: 

1. All outputs except VCC2DET and TDO float during the Tri-State Test mode. 

1 Floated off the dock edge ^at BOFF§ is sampled asserted and off the dock edge that HLDA is asserted. 

3. Floated off the dock edge that AHOLD is sampled asserted. 



Table 42. Test Pins 



Name 


Type 


Note 


TCK 


Clock 




TDl 


Input 


Sampled on the rising edge of TCK 


TDO 


Output 


Driven on the falling edge of TCK 


IMS 


Input 


Sampled on the rising edge of TCK 


TRST# 


Input 


Asynchronous (Independent of TCK) 
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Signal Descriptions 



Table 43. Bus Cycle Definition 



Bus Cycle Initialed 


Generated by the Processor 


Generated 

ku tfftA CuctMII 

vy UN: jjausni 


M/iO# 


D/C# 


W/R# 


CACHE# 


KEN# 


Code Read, instruction Cache Line Fill 




0 


0 


0 


0 


Code Read, Noncacheable 


1 


0 


0 


1 


X 


Code Read, Noncacheable 


1 


0 


0 


X 


1 


Encoding for Special Cyde 


0 


0 


1 


1 


X 


Interrupt Acknowledge 


0 


0 


0 


1 


X 


VQ Read 


0 




0 


I 


X 


(/O Write 


0 




1 


1 


X 


Memory Read, Data Cache Line Fill 






0 


0 


0 


Memory Read, Noncacheable 






0 


1 


X 


Memory Read, Noncacheable 






0 


X 


1 


Memory Write, Data Cache Writebadc 






1 


0 


X 


Memory W^ite, Noncacheable 






1 


1 


X 


X means "don't care' 



Table 44. Special Cydes 



Special Cycle 




BE7# 


DO 


UJ 
DO 


1 

ca 


m 


CD 


CD 


s 

Ul 
IB 


M/IO# 


5 


1 


i 




Stop Grant 


1 








1 


1 


0 


1 




0 


0 . 






X 


Rush Acknowledge 
(FLUSH# sampled asserted) 


0 








0 


1 


1 


1 




0 


0 






X 


Writeback 

(WBINVD instruction) 


0 








1 


0 


1 


1 




0 


0 






X 


Halt 


0 








1 


1 


0 


1 




0 


0 






X 


Rush (INVD. WBINVD 
instruction) 


0 








I 


1 


1 


0 




0 


0 






X 


Shutdown 


0 








1 


1 


1 


1 


0 


0 


0 






X 


Note: 

X means 'don't care" 
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6 

Bus Cycles 



The following sections describe and illustrate the timing and 
relationship of bus signals during various types of bus cycles. A 
representative set of bus cycles is illustrated. 

Timing Diagrams 



The timing diagrams illustrate the signals on the external local 
bus as a function of time, as measured by the bus clock (CLK). 
Throughout this chapter, the term clock refers to a single 
bus-clock cycle. A clock extends from one rising CLK edge to 
the next rising CLK edge* The processor samples and drives 
most signals relative to the rising edge of CLK. The exceptions 
to this rule include the following: 

■ BF[2:0]— Sampled on the falling edge of RESET 

■ FLUSH#, BRDYC#— Sampled on the falling edge of RESET, 
also sampled on the rising edge of CLK 

■ All inputs and outputs are sampled relative to TCK in 
Boundary-Scan Test Mode. Inputs are sampled on the rising 
edge of TCK, outputs are driven off of the falling edge of 
TCK. 

For each signal in the timing diagrams, the High level 
represents 1, the Low level represents 0, and the Middle level 
represents the floating (high-impedance) state. When both the 
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High and Low levels are shown, the meaning depends on the 
signal. A single signal indicates 'don't care'. In the case of bus 
activity, if both High and Low levels are shown, it indicates the 
processor, alternate master, or system logic is driving a value, 
but this value may or may not be valid. (For example, the value 
on the address bus is valid only during the assertion of ADS#, 
but addresses are also driven on the bus at other times.) Figure 
52 defines the different waveform representations. 



Waveform Description 

/ Signal or bus is chanpng from Low to High 

\ Signal or bus is changing from High to Low 

iT Bus is changing 

\ - > - Bus is changing from valid to invalid 



Signal or bus is floating 
Denotes multiple dock periods 



Figures!. Waveform Definitions 

For all active-High signals, the term asserted means the signal is 
in the High-voltage state and the term negated means the signal 
is in the Low-voltage state. For all active-Low signals, the term 
asserted means the signal is in the Low-voltage state and the 
term negated means the signal is in the High-voltage state. 

See "Signal Terminology" on page 139 for definitions of terms 
related to signals. 
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Bus Cycles 



Bus State Machine Diagram 
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Bus Cycles 

Mams 




Idle 



The processor does not drive the system bus in the Idle state 
and remains in this state until a new bus cycle is requested. The 
processor enters this state off the clock edge on which the last 
BRDY# of a cycle is sampled asserted during the following 
conditions: 

■ The processor is in the Data state 

■ The processor is in the Data-NA# Requested state and no 
internal pending cycle is requested 

In addition, the processor is forced into this state when the 
system logic asserts RESET or BOFF#. The transition to this 
state occurs on the clock edge on which RESET or BOFF# is 
sampled asserted. 



In this state, the processor drives ADS# to indicate the 
beginning of a new bus cycle by validating the address and 
control signals. The processor remains in this state for one clock 
and unconditionally enters the Data state on the next clock 
edge. : _ ' " : 



In the Data state, the processor drives the data bus during a 
write cycle or expects data to be returned during a read cycle. 
The processor remains in this state until either NA# or the last 
BRDY# is sampled asserted. If the last BRDY# is sampled 
asserted or both the last BRDY# and NA# are sampled asserted 
on the same clock edge, the processor enters the Idle state. If 
NA# is sampled asserted first, the processor enters the 
Data-NA# Requested state. 



If the processor samples NA# asserted while in the Data state 
and the current bus cycle is not completed (the last BRDY# is 
not sampled asserted), it enters the Data-NA# Requested state. 
The processor remains in this state until either the last BRDY# 



Address 



Data 



Data-NA# Requested 
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Bus Cycles £ 

is sampled asserted or an internal pending cycle is requested. If 
the last BRDY# is sampled asserted before the processor drives 
a new bus cycle, the processor enters the Idle state (no internal 
pending cycle is requested) or the Address state (processor has 
a internal pending cycle). 

Pipeline Address 

In this state, the processor drives ADS# to indicate the 
beginning of a new bus cycle by validating the address and 
control signals. In this state, the processor is still waiting for the 
current bus cycle to be completed (until the last BRDY# is 
sampled asserted). If the last BRDY# is not sampled asserted, 
the processor enters the Pipeline Data state. 

If the processor samples the last BRDY# asserted in this state, it 
determines if a bus transition is required between the current 
bus cycle and the pipelined bus cycle. A bus transition is 
required when the data bus direction changes between bus 
cycles, such as a memory write cycle followed by a memory read 
cycle. If a bus transition is required, the processor enters the 
Transition state for one clock to prevent data bus contention. If 
a bus transition is not required, the processor enters the Data 
state. . , . 

The processor does not transition to the Data-NA# Requested 
state from the Pipeline Address state because the processor 
does not begin sampling NA# until it has exited the Pipeline 
Address state. 

Pipeline Data 

Two bus cycles are concurrently executing in this state. The 
processor cannot issue any additional bus cycles until the 
current bus cycle is completed. The processor drives the data 
bus during write cycles or expects data to be returned during 
read cycles for the current bus cycle until the last BRDY# of the 
current bus cycle is sampled asserted. 

If the processor samples the last BRDY# asserted in this state, it 
determines if a bus transition is required between the current 
bus cycle and the pipelined bus cycle. If the bus transition is 
required, the processor enters the Transition state for one dock 
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to prevent data bus contention. If a bus transition is not 
required, the processor enters the Data state (NA# was not 
sampled asserted) or the Data-NA# Requested state (NA# was 
sampled asserted). 

Transitioii 

The processor enters this state for one clock during data bus 
transitions and enters the Data state on the next clock edge if 
NA# is not sampled asserted. The sole purpose of this state is to 
avoid bus contention caused by bus transitions during pipeline 
operation. 

Memory Reads and Writes 



The processor performs single or burst memory bus cycles. The 
single-transfer memory bus cycle transfers 1, 2, 4, or 8 bytes and 
requires a minimum of two clocks. Misaligned instructions or 
operands result in a split cycle, which requires multiple 
transactions on the bus. A burst cycle consists of four 
back-to-back 8-byte (64-bit) transfers on the data bus. 

Single^Transf er Memory Read and Write 

Figure 54 on page 193 shows a single- transfer read from memory, 
followed by two single-transfer writes to memory. For the 
memory read cycle, the processor asserts ADS# for one clock to 
validate the bus cycle and also drives A[31:3], BE[7:0]#, D/C#, 
W/R#, and M/IO# to the bus. The processor then waits for the 
system logic to return the data on D[63:01 (with DP[7:0] for 
parity checking) and assert BRDY#. The processor samples 
BRDY# on every clock edge starting with the clock edge after 
the clock edge that negates ADS#. See "BRDY# (Burst Ready)" 
on page 149. 

During the read cycle, the processor drives PCD, PWT, and 
CACHE# to indicate its caching and cache-coherency intent for 
the access. The system logic returns KEN# and WB/WT# to 
either confirm or change this intent. If the processor asserts 
PCD and negates CACHE#, the accesses are noncacheable, even 
though the system logic asserts KEN# during the BRDY# to 
indicate its support for cacheability. The processor (which 
drives CACHE#) and the system logic (which drives KEN#) must 
agree in order for an access to be cacheable. 
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The processor can drive another cycle (in this example, a write 
cycle) by asserting ADS# off the next clock edge after BRDY# is 
sampled asserted. Therefore, an idle clock is guaranteed 
between any two bus cycles. The processor drives D[63:0] with 
valid data one clock edge after the clock edge on which ADS# is 
asserted. To minimize processor idle times, the system logic 
stores the address and data in write buffers, returns BRDY#, and 
performs the store to memory later. If the processor samples 
EWBE# negated during a write cycle, it suspends certain 
activities until EWBE# is sampled asserted. See "EWBE# 
(External Write Buffer Empty)*' on page 157. In Figure 54, the 
second write cycle occurs during the execution of a serializing 
instruction. The processor delays the following cycle until 
£WB£# is sampled asserted. 



CLK V 



A[31:31 



Read Cycle 

ADOR DATA IDLE 



Write Cycle 

ADDR DATA DATA IDLE 



Write Cycle (Next Cycle Delayed by EWBE#; 

ADDR DATA DATA IDLE IDLE IDLE IDLE IDLE 



ADDR 



BE[7:0]#[: 
ADS*;' 

M/io#;- 

P/C#f 

BREQ L 
D[63:0] 1- 
DP[7:01 ^ 
CACHE* 

KEN#;" 
BRDY#r 



-cp- 

-CD- 



I 

JR. 



izr 



J L 



Figure 54. Non-Pipelined Single-Transfer Memory Read/Write and Write Delayed by EWBE# 
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Misaligned Single-Transfer Memory Read and Write 

Figure 55 on page 195 shows a misaligned (split) memory read 
followed by a misaligned memory write. Any cycle that is not 
aligned as defined in "SCYC (Split Cycle)" on page 175 is 
considered misaligned. When the processor encounters a 
misaligned access, it determines the appropriate pair of bus 
cycles — each with its own ADS# and BRDY# — required to 
complete the access. 

The processor performs misaligned memory reads and memory 
writes using least-significant bytes (LSBs) first followed by 
most-significant bytes (MSBs). Table 45 shows the order. In the 
first memory read cycle in Figure 55, the processor reads the 
least-significant bytes. Immediately after the processor 
samples BRDY# asserted, it drives the second bus cycle to read 
the most-significant bytes to complete the misaligned transfer. 



Table 45. Bus^ycle Order During Misaligned Transfers 



lype of Access 


First Cycle 


Second Cycle 


Memory Read 


LSBs 


MSBs 


Memory Wl-jte 


LSBs 


MSBs 



Similarly, the misaligned memory write cycle in Figure 55 
transfers the LSBs to the memory bus first. In the next cycle, 
after the processor samples BRDY# asserted, the MSBs are 
written to the memory bus. 
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Memory Read (Misaligned) 

ADDR DATA DATA IDLE ADDR DATA DATA IDLE 




Memory Write (Misaligned) 

ADDR DATA DATA DATA IDLE ADDR DATA DATA DATA IDLE 



Hgure 55. Misaligned Single-Transfer Memory Read and Write 

Burst Reads and Pipelined Burst Reads 

Figure 56 on page 197 shows normal burst read cycles and a 
pipelined burst read cycle. The AMD-K6 3D processor drives 
CACHE* and ADS# together to specify that the current bus 
cycle is a burst cycle. If the processor samples KEN# asserted 
with the first BRDY#, it performs burst transfers. During the 
burst transfers, the system logic must ignore BE[7:0]# and must 
return all eight bytes beginning at the starting address the 
processor asserts on A[31:3]. Depending on the starting 
address, the system logic must determine the successive 
quadword addresses (A[4:3]) for each transfer in a burst, as 
shown in Table 46 on page 196. The processor expects the 
second, third, and fourth quadwords to occur in the sequences 
shown in Table 46. 
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Bus Cycles 



Table 46. A[4:3] Address-Generation Sequence During Bursts 



Address Driven By 
Processor on A[4:3] 


A[4:3] Addresses of Subsequent 
Quadwords* Generated By System Logic 


Quadword 1 


Quadword 2 


Quadword 3 


Quadword 4 


00b 


01b 


10b 


lib 


01b 


00b 


lib 


10b 


10b 


lib 


00b 


01b 


lib 


10b 


01b 


00b 


Mbfe: 

* quadword = 8 bytes 



In Figure 56 on page 197, the processor drives CACHE# 
throughout all burst read cycles. In the first burst read cycle, 
the processor drives ADS# and CACHE#, then samples BRDY# 
on every clock edge starting with the clock edge after the clock 
edge that negates ADS#. The processor samples KEN# asserted 
on the clock edge on which the first BRDY# is sampled asserted, 
executes a 32-byte burst read cycle, and expects a total of four 
BRDY# signals. An ideal no-wait state access is shown in Figure 
56, whereas most system logic solutions add wait states between 
the transfers. 

The second burst read cycle illustrates a similar sequence, but 
the processor samples NA# asserted on the same clock edge 
that the first BRDY# is sampled asserted. NA# assertion 
indicates the system logic is requesting the processor to output 
the next address early (also known as a pipeline transfer 
request). Without waiting for the current cycle to complete, the 
processor drives ADS# and related signals for the next burst 
cycle. Pipelining can reduce processor cycle-to-cycle idle times. 
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Bus Cycles 




CLK if 
A[31:3] 
BE[7:01# 



Burst Read 

ADDR DATA DATA DATA DATA IDLE 



ADDR1 



Burst Read 

ADDR DATA DATA ^.^I^ 
i . -NA 



AD)R2 



JL 



PIPE 
■ADDi^ 



Pipelined Burst Read 



DATA DATA DATA DATA IDLE 



ADpR3 



DC 
DC 



ADS#r 

M/10#L 

D/C#|. 

W/R# 

NA# 
D[65:01 
CACHE* _ 

KEN#;" 
BRDY#i' 



DCHDC 



Figure 56. Burst Reads and Pipelined Burst Reads 



Burst Writeback 



Figure 57 on page 198 shows a burst read followed by a 
writeback transaction. The processor initiates writebacks under 
the following conditions: 

■ Replacement — If a cache-line fill is initiated for a cache line 
currently filled with valid entries, the processor uses a 
least-recently-allocated (LRA) algorithm to select a line for 
replacement. Before a replacement is made to a data cache 
line that is in the modified state, the modified line is 
scheduled to be written back to memory. 

■ Internal Snoop — The processor snoops the data cache 
whenever an instruction-cache line is read, and it snoops the 
instruction cache whenever a data cache line is written. This 
snooping is performed to determine whether the same 
address is stored in both caches, a situation that is taken to 
imply the occurrence of self -modifying code. If a snoop hits a 
data cache line in the modified state, the line is written back 
to memory before being invalidated. 
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■ WBINVD Instruction — When the processor executes a 
WBINVD instruction, it writes back all modified lines in the 
data cache and then invalidates all lines in both caches. 

■ Cache Flush — When the processor samples FLUSH# 
asserted, it executes a flush acknowledge special cycle and 
writes back all modified lines in the data cache and then 
invalidates all lines in both caches. 



The processor drives writeback cycles during inquire or cache 
flush cycles. The writeback shown in Figure 57 is caused by a 
cache-line replacement. The processor completes the burst read 
cycle that fills the cache line. Immediately following the burst 
read cycle is the burst writeback cycle that represents the 
modified line to be written back to memory. D[63:0] are driven 
one clock edge after the clock edge on which ADS# is asserted 
and are subsequently changed off the clock edge on which each 
of the four BRDY# signals of the burst cycle are sampled 
asserted. 



Burst Read 

ADDR DATA DATA DATA DATA IDLE 




Burst Wrilebadc from LI Cache 

ADDR DATA DATA DATA DATA IDLE 

I : ! 



Figure 57. Burst Writeback due to Cache-Line Replacement 
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Bus Cycles 




I/O Read and Write 



Basic I/O Read and Write 



The AMD-K6 3D processor accesses 170 when it executes an I/O 
instruction (for example, IN or OUT). Figure 58 on page 200 
shows an I/O read followed by an I/O write. The processor drives 
M/IO# Low and D/C# High during I/O cycles. In this example, 
the first cycle shows a single wait state I/O read cycle. It follows 
the same sequence as a single-transfer memory read cycle. The 
processor drives ADS# to initiate the bus cycle, then it samples 
BRDY# on every clock edge starting with the clock edge after 
the clock edge that negates ADS#. The system logic must return 
BRDY# to complete the cycle. When the processor samples 
BRDY# asserted, it can assert ADS# for the next cycle off the 
next clock edge. (In this example, an I/O write cycle.) 

The I/O write cycle is similar to a memory write cycle, but the 
processor drives M/IO# low during an I/O write cycle. The 
processor asserts ADS# to initiate the bus cycle. The processor 
drives D[63:0J with valid data one clock edge after the clock 
edge on which AI)S# is asserted. The system logic must assert 
BRDY# when the data is properly stored to the I/O destination. 
The processor samples BRDY# on every clock edge starting with 
the clock edge after the clock edge that negates ADS#. In this 
example, two wait states are inserted while the processor waits 
for BRDY# to be asserted. 
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Bus Cycles 



CLK 
A[51:3] [ 
BEl7:0]#t 
ADS#' 
M/IO# 
D/C# 
W/R# 



ADDR 



I/O Read Cycle 

DATA DATA IDLE 



ADDR 



I/O Write Cycle 

DATA DATA DATA IDLE 



D[63:0I r 



BRDY#| 

HgureSS. Basic I/O Read and Write 

Misaligned I/O Read and Write 




Table 47 shows the misaligned I/O read and write cycle order 
-executed by the processor. In Figure 59 on page 201, the 
least-significant bytes (LSBs) are transferred first. Immediately 
after the processor samples BRDY# asserted, it drives the 
second bus cycle to transfer the most-significant bytes (MSBs) 
to complete the misaligned bus cycle. 

Table 47. Bus-Cyde Order During Misaligned I/O Transfers 



lype of Access 


First Cycle 


Second Cyde 


I/O Read 


LSBs 


MSBs 


I/O Write 


LSBs 


MSBs 
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Bus Cycles 



Misaligned I/O Read 

ADDR DATA.DATA. IDLE ADDR DATA, DATA IDLE 




BRDY* 



figure 59. Misaligned I/O Transfer 



Inquire and Bus Arbitration Cycles 



The processor provides built-in level-one data and instruction 
caches. Each cache is 32 Kbytes and two-way set-associative. 
The system logic or other bus master devices can initiate an 
inquire cycle to maintain cache/memory coherency. In response 
to the inquire cycle, the processor compares the inquire address 
with its cache tag addresses in both caches, and, if necessary, 
updates the MESI state of the cache line and performs 
writebacks to memory. 

An inquire cycle can be initiated by asserting AHOLD, BOFF#, 
or HOLD. AHOLD is exclusively used to support inquire cycles. 
During AHOLD-initiated inquire cycles, the processor only 
floats the address bus. BOFF# provides the fastest access to the 
bus because it aborts any processor cycle that is in-progress,. 
whereas AHOLD and HOLD both permit an in-progress bus 
cycle to complete. During HOLD-initiated and BOFF#-initiated 
inquire cycles, the processor floats all of its bus-driving signals. 



201 



177AMD0060237 



Bus Cycles 



Hold and Hold Acknowledge Cycle 

The system logic or another bus device can assert HOLD to 
initiate an inquire cycle or to gain full control of the bus. When 
the processor samples HOLD asserted, it completes any 
in-progress bus cycle and asserts HLDA to acknowledge release 
of the bus. The processor floats the following signals off the 
same clock edge that HLDA is asserted: 



■ 


A[31:3] 


■ 


DP[7:0] 




ADS# 


■ 


LOCK# 


■ 


AP# 


■ 


wion 


■ 


BE[7:0]# 


■ 


PCD 


■ 


CACHE# 


■ 


PWT 


■ 


D[63:0] 


■ 


SCYC 


■ 


D/C# 


■ 


W/R# 



Figure 60 on page 203 shows a basic HOLD/HLDA operation. In 
this example, the processor samples HOLD asserted during the 
memory read cycle. It continues the ciurent memory read cycle 
until BRDY# is sampled asserted. The processor drives HLDA 
and floats its outputs one clock edge after the last BRDY# of the 
cycle is sampled asserted. The system logic can assert HOLD for 
as long as it needs to utilize the bus. The processor samples 
HOLD on every clock edge but does not assert HLDA until any 
in-progress cycle or sequence of locked cycles is completed. 

When the processor samples HOLD negated during a hold 
acknowledge cycle, it negates HLDA off the next clock edge. 
The processor regains control of the bus and can assert ADS# 
off the same clock edge on which HLDA is negated. 
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Figure 60. Basic HOLD/HLDA Operation 

HOLD-lnitiated Inquire Hit to Sliared or Exclusive Line 

Figure 61 on page 204 shows a HOLD-initiated inquire cycle. In 
this example, the processor samples HOLD asserted during the 
burst memory read cycle. The processor completes the current 
cycle (until the last expected BRDY# is sampled asserted), 
asserts HLDA and floats its outputs as described on page 202. 

The system logic drives an inquire cycle within the hold 
acknowledge cycle. It asserts EADS#, which validates the 
inquire address on A[31:5]. If £ADS# is sampled asserted 
before HOLD is sampled negated, the processor recognizes it as 
a valid inquire cycle. 

In Figure 61, the processor asserts HIT# and negates HITM# on 
the clock edge after the clock edge on which £ADS# is sampled 
asserted, indicating the current inquire cycle hit a shared or 
exclusive cache line. (Shared and exclusive cache lines in the 
processor data or instruction cache have the same contents as 
the data in the external memory.) During an inquire cycle, the 
processor samples INV to determine whether the addressed 
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Bus Cycles 



cache line found in the processor's instruction or data cache 
transitions to the invalid state or the shared state. In this 
example, the processor samples INV asserted with EADS#, 
which invalidates the cache line. 

The system logic can negate HOLD off the same clock edge on 
which EADS# is sampled asserted. The processor continues 
driving HIT# in the same state until the next inquire cycle. 
HITM# is not asserted unless HIT# is asserted. 







: 
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Burst Memory Read 
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Figure 61. HOUMnitiated Inquire Hit to Shared or Exdusive Line 
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Bus Cycles 



HOLD-lnitiated hiquiK HH to Modified line 

Figure 62 on page 206 shows the same sequence as Figure 61 on 
page 204, but in Figure 62 the inquire cycle hits a modified line 
and the processor asserts both HIT# and HITM#. In this 
example, the processor performs a writeback cycle immediately 
after the inquire cycle. It updates the modified cache Hne to the 
external memory (normally, level-two cache or DRAM). The 
processor uses the address (A[31:5]) that was latched during the 
inquire cycle to perform the writeback cycle. The processor 
asserts HITM# throughout the writeback cycle and negates 
HITM# one clock edge after the last expected BRDY# of the 
writeback is sampled asserted. 

When the processor samples EADS# during the inquire cycle, it 
also samples INV to determine the cache line MESI state after 
the inquire cycle. If INV is sampled asserted during an inquire 
cycle, the processor transitions the line (if found) to the invalid 
state, regardless of its previous state. The cache line 
invalidation operation is not visible on the bus. If INV is 
sampled negated during an inquire cycle, the processor 
transitions the line (if found) to the shared state. In Figure 62 
the processor samples INV asserted during the inquire cycle. 

In a HOLD-initiated inquire cycle, the system logic can negate 
HOLD off the same clock edge on which EADS# is sampled 
asserted. The processor drives HIT# and HITM# on the clock 
edge after the clock edge on which EADS# is sampled asserted. 
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AHOLD-lnitiafed Inquire Miss 

AHOLD can be asserted by the system to initiate one or more 
inquire cycles. To allow the system to drive the address bus 
during an inqiiire cycle, the processor floats A[31:3] and AF off 
the clock edge on which AHOLD is sampled asserted. The data 
bus and all other control and status signals remain under the 
control of the processor and are not floated. This functionality 
allows a bus cycle in progress when AHOLD is sampled asserted 
to continue to completion. The processor resumes driving the 
address bus off the clock edge on which AHOLD is sampled 
negated. 

In Figure 63 on page 208, the processor samples AHOLD 
asserted during the memory burst read cycle, and it floats the 
address bus off the same clock edge on which it samples AHOLD 
asserted. While the processor still controls the bus, it completes 
the current cycle until the last expected BRDY# is sampled 
asserted. The system logic drives EADS# with an inquire 
address on A[31:5] during an inquire cycle. The processor 
samples EADS# asserted and compares the inquire address to 
its tag address in both the instruction and data caches. In Figure 
63, the inquire address misses the tag address in the processor 
(both HIT* and inTM# are negated). Tlierefore, the processor 
proceeds to the next cycle when it samples AHOLD negated. 
(The processor can drive a new cycle by asserting ADS# off the 
same clock edge that it samples AHOLD negated.) 

For an AHOLD-initiated inquire cycle to be recognized, the 
processor must sample AHOLD asserted for at least two 
consecutive clocks before it samples EADS# asserted. If the 
processor detects an address parity error during an inquire 
cycle, APCHK# is asserted for one clock. The system logic must 
resjwnd appropriately to the assertion of this signal. 
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Bus Cycles 



AHOLD-lnitiated Inquire Hit to Sliared or Exclusive Line 

In Figure 64, the processor asserts HIT# and negates HITM# off 
the clock edge after the clock edge on which EADS# is sampled 
asserted, indicating the current inquire cycle hits either a 
shared or exclusive line. (HIT# is driven in the same state imtil 
the next inquire cycle.) The processor samples INV asserted 
during the inquire cycle and transitions the line to the invalid 
state regardless of its previous state. 

During an AHOLD-initiated inquire cycle, the processor 
samples AHOLD on every clock edge until it is negated. In 
Figure 64, the processor asserts ADS# off the same clock on 
which AHOLD is sampled negated. If the inquire cycle hits a 
modified line, the processor performs a writeback cycle before 
it drives a new bus cycle. The next section describes the 
AHOLD-initiated inquire cycle that hits a modified line. 



CLK 
A[31:31 
BE[7:0l# 
ADS# 
M/10# 
D/C# 
W/R# 
HIT# 
HITM# 
D[63:0] 
KEN# 
BRDY# 
AHOLD 
EADS# 
INV 



Burst Memory Read 

i > ! 
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Figure 64. AHOLD-lnitiated Inquire Hit to Sliared or Exclusive Line 
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AHOLD-lnitiated Inquire Hit to Modified Line 

Figure 65 on page 211 shows an AHOLD-initiated inquire cycle 
that hits a modified line. During the inquire cycle in this 
example, the processor asserts both HIT# and HITM# on the 
clock edge after the clock edge that it samples EADS# asserted. 
This condition indicates that the cache line exists in the 
processor's data cache in the modified state. 

If the inquire cycle hits a modified line, the processor performs 
a writeback cycle immediately after the inquire cycle to update 
the modified cache line to shared memory (normally level-two 
cache or DRAM). In Figure 65, the system logic holds AHOLD 
asserted throughout the inquire cycle and the processor 
writeback cycle. In this case, the processor is not driving the 
address bus during the writeback cycle because AHOLD is 
sampled asserted. The system logic writes the data to memory 
by using its latched copy of the inquire cycle address. If the 
processor samples AHOLD negated before it performs the 
writeback cycle, it drives the writeback cycle by using the 
address (A[31:5]) that it latched during the inquire cycle. 

If INV is sampled asserted during an inquire cycle, the 
processor transitions the line (if found) to the invalid state, 
regardless of its previous state (the cache invalidation 
operation is not visible on the bus). If INV is sampled negated 
during an inquire cycle, the processor transitions the line (if 
found) to the shared state. In either case, if the line is found in 
the modified state, the processor writes it back to memory 
before changing its state. Figure 65 shows that the processor 
samples INV asserted during the inquire cycle and invalidates 
the cache line after the inquire cycle. 
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Figure 65. AHOLD*lnitiated Inquire Hit to Modified Line 
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AHOLD Restriction 

When the system logic drives an AHOLD-initiated inquire 
cycle, it must assert AHOLD for at least two clocks before it 
asserts EADS#. This requirement guarantees the processor 
recognizes and responds to the inquire cycle properly. The 
processor*s 32 address bus drivers turn on almost immediately 
after AHOLD is sampled negated. If the processor switches the 
data bus (D[63:0] and DP[7:0]) during a write cycle off the same 
clock edge that switches the address bus (A[31:3] and AP), the 
processor switches 102 drivers simultaneously, which can lead 
to ground-bounce spikes. Therefore, before negating AHOLD 
the following restrictions must be observed by the system logic: 

■ When the system logic negates AHOLD during a write cycle, 
it must ensure that AHOLD is not sampled negated on the 
clock edge on which BRDY# is sampled asserted (See Figure 
66 on page 213). 

■ When the system logic negates AHOLD during a writeback 
cycle, it must ensure that AHOLD is not sampled negated on 
the clock edge on which ADS# is negated (See Figure 66). 

■ When a write cycle is pipelined into a read cycle, AHOLD 
must not be sampled negated on the clock edge after the 
clock edge on which the last BRDY# of the read cycle is 
sampled asserted to avoid the processor simultaneously 
driving the data bus (for the pending write cycle) and the 
address bus off this same clock edge. 
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AHOLD 



Illegal /^HOLD riegationj during Write cype 



J 



Legal AHOLD riegationf during yvrite cycle 



The system must ensure that aJhOLD i$ not sampled negated |)n the c ock ed^e that /^DS# is legatedl 

V ■ \ 



a: 



The system must ensure that AHOLD is not sampled negated on the clock edge on which BRDY# is sample 
asserted. 



Figure 66. AHOLD Restriction 



Bus Backoff (BOFF#) 



BOFF# provides the fastest response among bus-hold inputs. 
Either the system logic or another bus master can assert BOFF# 
to gain control of the bus immediately. BOFF# is also used to 
resolve potential deadlock problems that arise as a result of 
inquire cycles. The processor samples BOFF# on every clock 
edge. If BOFF# is sampled asserted, the processor 
unconditionally aborts any cycles in progress and transitions to 
a bus hold state. (See "BOFF# (Backoff)" on page 148.) Figure 
67 on page 214 shows a read cycle that is aborted when the 
processor samples BOFF# asserted even though BRDY# is 
sampled asserted on the same clock edge. The read cycle is 
restarted after BOFF# is sampled negated (KEN# must be in 
the same state during the restarted cycle as its state during the 
aborted cycle). 
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Bus Cycles 




During a BOFF#-initiated inquire cycle that hits a shared or 
exclusive line, the processor samples BOFF# negated and 
restarts any bus cycle that was aborted when BOFF# was 
asserted. If a BOFF#-initiated inquire cycle hits a modified line, 
the processor performs a writeback cycle before it restarts the 
aborted cycle. 

If the processor samples BOFF# asserted on the same clock 
edge that it asserts ADS#, ADS# is floated but the system logic 
may erroneously interpret ADS# as asserted. In this case, the 
system logic must properly interpret the state of ADS# when 
BOFF# is negated. 




Figure 67. BOFF# Timing 
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Locked Cycles 

The processor asserts LOCK# during a sequence of bus cycles to 
ensure the cycles are completed without allowing other bus 
masters to intervene. Locked operations can consist of two to 
five cycles. LOCK# is asserted during the following operations: 

■ An interrupt acknowledge sequence 

■ Descriptor Table accesses 

■ Page Directory and Page Table accesses 

■ XCHG instruction 

■ An instruction with an allowable LOCK prefix 

In order to ensure that locked operations appear on the bus and 
are visible to the entire system, any data operands addressed 
during a locked cycle that reside in the processor's cache are 
flushed and invalidated from the cache prior to the locked 
operation. If the cache line is in the modified state, it is written 
back and invalidated prior to the locked operation. Likewise, 
any data read during a locked operation is not cached. The 
processor negates LOCK# for at least one clock between 
consecutive sequences of locked operations to allow the system 
logic to arbitrate for the bus. 

The processor asserts SCYC during misaligned locked transfers 
on the D[63:0] data bus. The processor generates additional bus 
cycles to complete the transfer of misaligned data. 

Basic Locked Operation 

Figure 68 on page 216 shows a pair of read-write bus cycles. It 
represents a typical read-modify-write locked operation. The 
processor asserts LOCK# off the same clock edge that it asserts 
ADS# of the first bus cycle in the locked operation and holds it 
asserted until the last expected BRDY# of the last bus cycle in 
the locked operation is sampled asserted. (The processor 
negates LOCK# off the same clock edge.) 
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1 




ADDR 



Locked Read Cyde 
DATA DATA DATA IDLE. IDLE 




Locked Write Cycle 
ADDR DATA DATA DATA IDLE 



IDLE 



ADDR 



Figure 68. Bask Locked Operation 

Locked Operation with BOFF# Interventioa 



Figure 69 on page 217 shows BOFF# asserted within a locked 
read-write pair of bus cycles. In this example, the processor 
asserts LOCK# with ADS# to drive a locked memory read cycle 
followed by a locked memory write cycle. During the locked 
memory write cycle in this example, the processor samples 
BOFF# asserted. The processor inunediately aborts the locked 
memory write cycle and floats all its bus-driving signals, 
including LOCK#. The system logic or another bus master can 
initiate an inquire cycle or drive a new bus cycle one clock edge 
after the clock edge on which BOFF# is sampled asserted. If the 
system logic drives a BOFF#-initiated inquire cycle and hits a 
modified line, the processor performs a writeback cycle before 
it restarts the locked cycle (the processor asserts LOCK# during 
the writeback cycle). 

In Figure 69 on page 217, the processor inunediately restarts 
the aborted locked write cycle by driving the bus off the clock 
edge on which BOFF# is sampled negated. The system logic 
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must ensure the processor results for interrupted and 
uninterrupted locked cycles are consistent. That is, the system 
logic must guarantee the memory accessed by the processor is 
not modified during the time another bus master controls the 
bus. 



Restart Write Cyde 




Figure 69. Locked Operation witii BOFF# intervention 
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Interrupt Acknowledge 

In response to recognizing the system's maskable interrupt 
(INTR), the processor drives an interrupt acknowledge cycle at 
the next instruction boundary. During an interrupt 
acknowledge cycle, the processor drives a locked pair of read 
cycles as shown in Figure 70 on page 219. The first read cycle is 
not functional, and the second read cycle returns the interrupt 
number on D[7:0] (OOh-FFh). Table 48 shows the state of the 
signals during an interrupt acknowledge cycle. 



Table 48. Intemipt Acknowledge Operation Definition 



Processor Outputs 


First BusCyde 


Second Bus Cyde 


D/C# 


Low 


Low 


M/IO# 


Low 


Low 


W/R# 


Low 


Low 


BE[7:0l# 


EFh 


FEh (tow byte enabled) 


A[31:3] 


oooo.ooooh 


OOOO^OOOOh 


D[63:0] 


(ignored) 


Interrupt number expected from interrupt 
controller on D[7:0] 



The system logic can drive INTR either synchronously or 
asynchronously. If it is asserted asynchronously, it must be 
asserted for a minimum pulse width of two clocks. To ensure it 
is recognized, INTR must remain asserted until an interrupt 
acknowledge sequence is complete. 
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Bus Cycles 



QK r\_ 

A[3l:3l 
BEt7:0]# 
ADS# 

M/10# 

D/C#; 

W/R#" 
LOCK*' 

INTR f 
D[53:0l \ 

KEN#' 
BRDY# 



0 



x: 



Interrupt Acknowledge Cydes 
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^5" 
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-5- 
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Rgure 70. Interrupt Acknowledge Operation 
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Special Bus Cycles 



The AMD-K6 3D processor drives special bus cycles that 
include stop grant, flush acknowledge, cache writeback 
invalidation, halt, cache invalidation, and shutdown cycles. 
During all special cycles, D/C# = 0, M/IO# = 0, and W/R# = 1. 
BE[7:0]# and A[31:3] are driven to differentiate among the 
special cycles, as shown in Table 49. The system logic must 
return BRDY# in response to all processor special cycles. 



Table 49. Encodings For Special Bus Cycles 



BEp:0]# 


AI4:3I* 


Special Bus Cycle 


Cause 


FBh 


lOb 


Stop Grant 


STPCLK# sampled asserted 


EFh 


. OOb 


Flush Acknowledge 


FLUSH# sampled asserted 


. F7h 


00b 


Wiriteback 


WBINVD instruction 


FBh 


OOb 


Halt 


HLT instruction 


FDh 


OOb 


Flush 


INVD^WBINVD instruction 


FEh 


OOb 


Shutdown 


Triple fault 


Note: 

* A[5l:5]=0 



Basic Special Bus Cycle 

Figure 71 on page 221 shows a basic special bus cycle. The 
processor drives D/C# = 0, M/IO# = 0, and W/R# = 1 off the same 
clock edge that it asserts ADS#. In this example, BE[7:0]# = FBh 
and A[31:3] = 0000_0000h, which indicates that the special 
cycle is a halt special cycle (See Table 49). A halt special cycle is 
generated after the processor executes the HLT instruction. 

If the processor samples FLUSH# asserted, it writes back any 
data cache lines that are in the modified state and invalidates 
all lines in the instruction and data cache. The processor then 
drives a flush acknowledge special cycle. 

If the processor executes a WBINVD instruction, it drives a 
writeback special cycle after the processor completes 
invalidating and writing back the cache lines. 



220 



177AMD0060256 




177AMD0060257 



Bus Cycles 



Shutdown Cycle 



In Figure 72, a shutdown (triple fault) occurs in the first half of 
the waveform, and a shutdown special cycle follows in the 
second halt The processor enters shutdown when an interrupt 
or exception occurs during the handling of a double fault (INT 
8), which amounts to a triple fault. When the processor 
encounters a triple fault, it stops its activity on the bus and 
generates the shutdown special bus cycle (BE[7:0]# = FEh). 

The system logic must assert NMI, INIT, RESET, or SMI# to get 
the processor out of the shutdown state. 



Shutdown Special Cyde 




BRDY#i' 



Figure 72. Shutdown Cycle 
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stop Grant and Stop Clock States 

Figure 73 on page 224 and Figure 74 on page 225 show the 
processor transition from normal execution to the Stop Grant 
state, then to the Stop Clock state, back to the Stop Grant state, 
and finally back to normal execution. The series of transitions 
begins when the processor samples STPCLK# asserted. On 
recognizing a STPCLK# interrupt at the next instruction 
retirement boundary, the processor performs the following 
actions, in the order shown: 

1. Its instruction pipelines are flushed 

2. All pending and in-progress bus cycles are completed 

3. The STPCLK# assertion is acknowledged by executing a 
Stop Grant special bus cycle 

4. Its internal clock is stopped after BRDY# of the Stop Grant 
special bus cycle is sampled asserted and after £WB£# is 
sampled asserted 

5. The Stop Clock state is entered if the system logic stops the 
bus clock CLK (optional) 

STPCLK# is sampled as a level-sensitive input on every clock 
edge but is not recognized imtil the next instruction boundary. 
The system logic drives the signal either synchronously or 
asynchronously. If it is asserted asynchronously, it must be 
asserted for a minimum pulse width of two clocks. STPCLK# 
must remain asserted until recognized, which is indicated by 
the completion of the Stop Grant special cycle. 
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Bus Cycles 
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Figure 7S. Stop Grant and Stop Clock Modes, Part 1 
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Bus Cycles 
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Figure 74. Stop Grant and Stop flock Modes, Part 2 

INIT-lnitiated Transition from Protected Mode to Real Mode 

INIT is typically asserted in response to a BIOS interrupt that 
writes to an I/O port. This interrupt is often in response to a 
Ctrl-Alt-Del keyboard input. The BIOS writes to a port (similar 
to port 64h in the keyboard controller) that asserts INIT. INIT is 
also used to support 80286 software that must return to Real 
mode after accessing extended memory in Protected mode. 

The assertion of INIT causes the processor to empty its 
pipelines, initialize most of its internal state, and branch to 
address FFFF_FFFOh — the same instruction execution starting 
point used after RESET. Unlike RESET, the processor 
preserves the contents of its caches, the floating-point state, the 
MMX state, model-specific registers (MSRs), the CD and NW 
bits of the CRO register, the time stamp counter, and other 
specific internal resources. 
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Figure 75 shows an example in which the operating system writes 
to an I/O port, causing the system logic to assert INIT. The 
sampling of INIT asserted starts an extended microcode 
sequence that terminates with a code fetch from FFFF^FFFOh, 
the reset location. INIT is sampled on every clock edge but is not 
recognized until the next instruction boundary. During an I/O 
write cycle, it must be sampled asserted a minimum of three 
clock edges before BRDY# is sampled asserted if it is to be 
recognized on the boundary between the I/O write instruction 
and the following instruction. If INIT is asserted synchronously, 
it can be asserted for a minimum of one clock. If it is asserted 
asynchronously, it must have been negated for a minimum of two 
clocks, followed by an assertion of a minimum of two clocks. 




Figure 75. INU-lnitiated Transition from Protected R/lode to Real Mode 
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7 

Power-on Configuration and 
Initialization 

On power-on the system logic must reset the AMD-K6 3D 
processor by asserting the RESET signal. When the processor 
samples RESET asserted, it immediately flushes and initializes 
all internal resources and its internal state, including its 
pipelines and caches, the floating-point state, the MMX and 3D 
states, and all registers. Then the processor jumps to address 
FFFF_FFFOh to start instruction execution. 

Signals Sampled During the Falling Transition of RESET 



FLUSH* 

FLUSH# is sampled on the falling transition of RESET to 
determine if the processor begins normal instruction execution 
or enters Tri-State Test mode. If FLUSH# is High during the 
falling transition of RESET, the processor unconditionally runs 
its Built-in Self Test (BIST), performs the normal reset 
functions, then jumps to address FFFF_FFFOh to start 
instruction execution. (See "Built-in Self-Test (BIST)" on page 
270 for more details.) If FLUSH# is Low during the falling 
transition of RESET, the processor enters TVi-State Test mode. 
(See "Tri-State Test Mode'' on page 270 and "FLUSH# (Cache 
Flush)" on page 159 for more details.) 
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Power-on Configuration and Initialization 



BF[2:0] 

The internal operating frequency of the processor is 
determined by the state of the bus frequency signals BF[2:0] 
when they are sampled during the falling transition of RESET. 
The frequency of the CLK input signal is multiplied internally 
by a ratio defined by BF[2:0]. ("BF[2:0] (Bus Frequency)" on 
page 147 for the processor-clock to bus-clock ratios.) 

BRDYC# 

BRDYC# is sampled on the falling transition of RESET to 
configure the drive strength of A[20:3], ADS#, HITM#, and 
W/R#. If BRDYC# is Low during the fall of RESET, these 
outputs are configured using higher drive strengths than the 
standard strength. If BRDYC# is High during the fall of RESET, 
the standard strength is selected. (See "BRDYC# (Burst Ready 
Copy)" on page 150 for more details.) 



RESET Requirements 



During the initial power-on reset of the processor, RESET must 
remain asserted for a minimum of 1.0 ms after CLK and Vqc 
reach specification. (See "CLK Switching Characteristics" on 
page 314 for clock specifications. See Chapter 14, "Electrical 
Data" on page 303 for Vcc specifications.) 

During a warm reset while CLK and Vqc ^i^e within 
specification, RESET must remain asserted for a minimum of 
15 clocks prior to its negation. 
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Power-on Configuration and Initialization 
state of Processor After RESET 




Output Signals 

Table 50 shows the state of all processor outputs and 
bidirectional signals immediately after RESET is sampled 
asserted. 

IllbleSO. Output Signal State After RESET 



Signal 


State 


Signal 


State 


A[31:3I,AP 


Floating 


LOCK* 


High 


ADS#,ADSC# 


High 


M/IO# 


Low 


APCHK# 


High 


PCD 


Low 


BE[7:0]# 


Floating 


PCHK# 


High 


BREQ 


Low 


PWT 


Low 


CACHE* 


High 


sac 


Low 


D/C# 


Low 


SMIAQ* 


High 


D[63:0], DP[7:0] 


Floating 


TDO 


Floating 


FERR# 


High 


VCC2DET 


Low 


HIT# 


High 


VCC2H/L# 


Low 


HITM# 


High 


W/R# 


Low 


HLDA 


Low 







Registers 

Table 51 on page 230 shows the state of all architecture 
registers and model-specific registers (MSRs) after the 
processor has completed its initialization due to the recognition 
of the assertion of RESET. 
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Power-on Configuration and Initialization 



Table 51. Register State After RESET 



Register 


State (hex) 


Notes 


GDTR 


base:0000_0000h limitOFFFFh 




IDTR 


base:0000_0000h limitOFFFFh 




TR 


OOOOh 




LDTR 


OOOOh 




EIP 


FFFFJFFOh 




EFLAGS 


0000_0002h 




EAX 


OOOO^OOOOh 


1 


EBX 


OOOO^OOOOh 




EQ 


OOOO.OOOOh 




EDX 


0000_058Xh 


2 


ESI 


OOOO.OOOOh 




EDI 


OOOO^OOOOh 




EBP 


OOOO^OOOOh 




ESP 


OOOO.OOOOh 




CS 


FOOOh 




SS 


OOOOh 




DS 


OOOOh 




ES 


OOOOh 




FS 


OOOOh 




GS 


OOOOh 




FPU Stack R7-R0 


0000_0000_0000_0000_0000h 


3 


FPU Control Word 


0040h 


3 


FPU Status Word 


OOOOh 


3 


FPU Tag Word 


5555h 


3 


FPU Instruction Pointer 


0000_0000_0000h 


3 


FPU Data Pointer 


0000_0000_0000h 


3 


FPU Opcode Register 


000_0000_0000b 


3 


CRO 


6000.0010h 


4 


CR2 


OOOO.OOOOh 




Notesi 

1 Thecont€ntsofEAXindicateifBISTwassuccessful.lfEAX=^0000 OOOOh, BIST was successful. 

If EAX is non-zero, BIST failed, 

2 EDX contains tfie AMD-K6 processor signature, wfiere X indicates tf)€ processor Stepping ID. 
S. The contents oftt)ese registers are preserved following the recognition of INff. 

A. Tf\e CD and f\IW bits of CRO are presen/ed following the recognition of INIl 
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Power-on Configuration and Initialization 




TaUe 51. Register State After RESET (conGnued) 



Register 


State (hex) 


Notes 


CR3 


oooo.ooooh 




CR4 


0000_0000h 




DR7 


0000_0400h 




DR6 


FFFF^OFFOh 




DR3 


0000_0000h 




DR2 


OOOO.OOOOh 




DR1 


OOOO.OOOOh 




DRO 


oooo.ooooh 




MCAR 


0000_0000_0000_0000h 


3 


MQR 


0000_0000_0000_0000h 


3 


TR12 


0000_0000_0000^0000h 


5 


TSC 


0OOO_OOO0_0O00_000Oh 


5 


EFER 


0000_0000_0000_0000h 


3 


STAR 


OOOO.DOOO.OOOO^OOOOh 


3 


WHCR 


0000_0000^0000_0000h 


3 


Notes: 

/. The contents of EAX indicate if BIST was successful. IfEAX- 0O00J)O00h, BIST was successful 

IfEAXis non-zero, BIST failed. 
2. EDX contains the AMD K6 processor signature, wfiere X indicates the processor Stepping ID. 
5. The contents of these registers ore preserved following the recognition oflNIT. 
4. The CD and NW bits of CfKf ore preserved following the recognition of INIT. 
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7 Power-on Configuration and Initialization 



state off Processor After INIT 



The recognition of the assertion of INIT causes the processor to 
empty its pipelines, to initialize most of its internal state, and to 
branch to address FFFF_FFFOh — the same instruction 
execution starting point used after RESET. UnJike RESET, the 
processor preserves the contents of its caches, the 
floating-point state, the MMX and 3D states, MSRs, and the CD 
and NW bits of the CRO register. 

The edge-sensitive interrupts FLUSH# and SMI# are sampled 
and preserved during the INIT process and are handled 
accordingly after the initialization is complete. However, the 
processor resets any pending NMI interrupt upon sampling 
INIT asserted. 

INIT can be used as an accelerator for 80286 code that requires 
a reset to exit from Protected mode back to Real mode. 
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8 

Cache Organization 



The following sections describe the basic architecture and 
resources of the AMD-K6 3D processor internal caches. 

The performance of the processor is enhanced by a writeback 
level-one (LI) cache. The cache is organized as a separate 
32-Kbyte instruction cache and a 32-Kbyte data cache, each 
with two-way set associativity (See Figure 76 on page 234). The 
cache line size is 32 bytes, and lines are prefetched from main 
memory using an efficient, pipelined burst transaction. As the 
instruction cache is filled, each instruction byte is analyzed for 
instruction boundaries using predecode logic. Predecoding 
annotates each instruction byte with information that later 
enables the decoders to efficiently decode multiple instructions 
simultaneously. Translation lookaside buffers (TLB) are also 
used to translate linear addresses to physical addresses. The 
instruction cache is associated with a 64-entry TLB while the 
data cache is associated with a 128-entry TLB. 
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Cache Organization 



32-ICbyte Instorction Cache 



System Bus 
Interface Unit 



1 " 1 

Tag' wayO [st^^ 
RAMI 1 Bit 
1 1 


1 J 

Tag ' Way 1 |state 
RAM 1 'Bit 
1 1 


64-EntrYTlR 






Pre-Decode Instruction Cache 








128-EntiyTLB 


Tag ' vVayO '^ESI 
RAMI \ Bits 
1 1 

i 1 


1 r 

Tag' vteyi 'mESI 
RAMi iBits 
1 1 



32-)(byte Data ache 




Figure 76. Cache Organization 

The processor cache design takes advantage of a sectored 
organization (See Figure 77). Each sector consists of 64 bytes 
configured as two 32-byte cache lines. The two cache lines of a 
sector share a common tag but have separate MESI (modified, 
exclusive, shared, invalid) bits that track the state of each cache 
line. 



Instruction Cache Line 



Tag 
Address 


Cache Line 1 


Byte 31 


Predecode Bits 


Byte 30 


Predecode Bits 






ByteO 


Predecode Bits 


I MESI Bit 1 


Cache Line 2 


Byte 31 


Predecode Bits 


Byte 30 


Predecode Bits 






ByteO 


Predecode Bits 


IMESIBit 



Data Cache Line 



Tag 
Address 


Cache Line l 


BYte31 


Byte 30 






ByteO 


2 MESI Bits 


Cache Line 2 


Byte 31 


Byte 30 






ByteO 


2MBIBits 



Note: Instruction-cache lines have only two coherency states (valid or invalid) rather than 
the four MESI coherency states of data<ache lines. Only two states are needed for the 
instruction cache because these lines are read-only. 

Figure 77. Cache Sector Organization 
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Cache Organization 



MESI States in the Data Cache 



The state of each line in the caches is tracked by the MESI bits. 
The coherency of these states or MESI bits is maintained by 
internal processor snoops and external inquiries by the system 
logic. The following foiir states are defined for the data cache: 

■ Modified — This line has been modified and is different from 
main memory. 

■ Exclusive — This line is not modified and is the same as main 
memory. If this line is written to, it becomes Modified. 

■ Shared — If a cache line is in the shared state it means that 
the same line can exist in more than one cache system. 

■ Invalid — The information in this line is not valid. 



Predecode Bits 



Decoding x86 instructions is particularly difficult because the 
instructions vary in length, ranging from 1 to 15 bytes long. 
Predecode logic supplies the predecode bits associated with 
each instruction byte. The predecode bits indicate the number 
of bytes to the start of the next x86 instruction. The predecode 
bits are passed with the instruction bytes to the decoders where 
they assist with parallel x86 instruction decoding. The 
predecode bits use memory separate from the 32-Kbyte 
instruction cache. The predecode bits are stored in an extended 
instruction cache alongside each x86 instruction byte as shown 
in Figure 77 on page 234. 



Giche Operation 



The operating modes for the caches are configured by software 
using the not writethrough (NW) and cache disable (CD) bits of 
control register 0 (CRO bits 29 and 30 respectively). These bits 
are used in all operating modes. 

When the CD and NW bits are both set to 0, the cache is fully 
enabled. This is the standard operating mode for the cache. If a 
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Cache Organization 

read miss occurs when the processor reads from the cache, a 
line fill takes place. Write hits to the cache are updated, while 
write misses and writes to shared lines cause external memory 
updates. 

Note: A write allocate operation can modify the behavior of write 
misses to the cache. See ''Write Allocate*" on page 240 

When CD is set to 0 and NW is set to 1, an invalid mode of 
operation exists that causes a general protection fault to occur. 

When CD is set to 1 (disabled) and NW is set to 0, the cache fill 
mechanism is disabled but the contents of the cache are still 
valid. The processor reads from the cache and, if a read miss 
occurs, no line fills take place. Write hits to the cache are 
updated, while write misses and writes to shared lines cause 
external memory updates. 

When the CD and NW bits are both set to 1, the cache is fully 
disabled. Even though the cache is disabled, the contents are 
not necessarily invalid. The processor reads from the cache and, 
if a read miss occurs, no line fills take place. If a write hit 
occurs, the cache is updated but an external memory update 
does not occur. If a data line is in the exclusive state during a 
write hit, the MESI bits are changed to the modified state. 
Write misses access memory directly. 

The operating system can control the cacheability of a page. 
The paging mechanism is controlled by CR3, the Page Directory 
Entry (PDE), and the Page Table Entry (PTE). Within CR3, 
PDE, and PTE are Page Cache Disable (PCD) and Page 
Writethrough (PWT) bits. The values of the PCD and PWT bits 
used in Table 52 through Table 54 on page 237 are taken from 
either the PTE or PDE. For more information see the 
descriptions of PCD and PWT on pages 171 and 173, 
respectively. 

Table 52 through Table 54 describe the logic that determines 
the cacheability of a cycle and how that cacheability is affected 
by the PCD bits, the PWT bits, the PG bit of CRO, the CD bit of 
CRO, writeback cycles, the Cache Inhibit (CI) bit of Test 
Register 12 (TR12), and unlocked memory reads. 
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m 



Table 52 describes how the PWT signal is driven based on the 
values of the PWT bits and the PG bit of CRO. 

Table 52. PWT Signal Generatioii 



PWT BH* 


PG BHofCRO 


PWT Signal 


1 


1 


High 


0 


1 


Low 


1 


0 


Low 


0 


0 


Low 


Nate: 

* PWT is taken from PTE or PDE 



Table 53 describes how the PCD signal is driven based on the 
values of the CD bit of CRO, the PCD bits, and the PG bit of 
CRO. 

Table 53. PCD Signal Generation 



CD Bit of CRO 


PCD Bit* 


PG Bit of CRO 


PCD Signal 


1 


X 


X 


High 


0 


1 


1 


High 


0 


0 


1 


Low 


0 


1 


0 


Low 


0 


0 


0 


Low 


Note: 

* PCD is taken from PTE or PDE 



Table 54 describes how the CACH£# signal is driven based on 
writeback cycles, the CI bit of TR12, unlocked memory reads, 
and the PCD signal. 

Table 54. CACHE# Signal Generation 



Writeback 
Cycle 


CI BitofTR12 


Unlocked 
Memory Reads 


PCD Signal 


CACHE# 


1 


X 


X 


X 


Low 


0 


1 


1 


High 


High 


0 


0 


1 


High 


High 


0 


1 


0 


High 


High 


0 


0 


0 


High 


High 


0 


1 


1 


Low 


High 
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Cache Organization 



Table 54. CACHE# Signal Generation (continued) 



Writeback 
Cyde 


CI BftofTRI2 


Unlocked 
Memory Reads 


PCD Signal 


CACHED 


0 


0 




Low 


Low 


0 


1 


0 


Low 


High 


0 


0 


0 


Low 


High 



Cache-Related Signals 



Complete descriptions of the signals that control cacheability 
and cache coherency are given on the following pages: 

■ CACHE#— page 152 

■ EADS#— page 156 

■ FLUSH#— page 159 

■ HIT#— page 160 

■ HITM#— page 160 

■ INV— page 165 

m KEN#— page 166 

■ PCD— page 171 
m PWT— page 173 

■ WBAVT#— page 183 



Cache Disabling 



To completely disable all cache accesses, the CD and NW bits 
must be set to 1 and the cache must be completely flushed. 

There are two different methods for flushing the cache. The 
first method relies on the system logic and the second relies on 
software. 



For the system logic to flush the cache, the processor must 
sample FLUSH# asserted. In this method, the processor writes 
back any data cache lines that are in the modified state, 
invalidates all lines in the instruction and data caches, and then 
executes a flush acknowledge special cycle (See Table 44 on 
page 186). 
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Software can use two different instructions to flush the cache. 
Both the WBINVD and INVD instructions cause aU cache lines 
to be marked invalid. The WBINVD instruction causes all 
modified lines to first be written back to memory. The INVD 
instruction invalidates all cache lines without writing modified 
lines back to memoiy 

Any area of system memory can be cached. However, the 
processor prevents caching of locked operations and TLB reads, 
the operating system can prevent caching of certain pages by 
setting the PCD and PWT bits in the PDE or PTE, and system 
logic can prevent caching of certain bus cycles by negating the 
KEN# input signal with the first BRDY# or NA# of a cycle. 

Cache-Une Fills 



When the processor needs to read memory, the processor drives 
a read cycle onto the bus. If the cycle is cacheable the processor 
asserts CACHE#, The system logic also has control of the 
cacheability of bus cycles. If it determines the address is 
cacheable, system logic asserts the KEN# signal and the 
appropriate value of WBAVT#. 

One of two events takes place next. If the cycle is not cacheable, 
a non-pipelined, single-transfer read takes place. The processor 
waits for the system logic to return the data and assert a single 
BRDY# (See Figure 54 on page 193). If the cycle is cacheable, 
the processor executes a 32-byte burst read cycle. The processor 
expects a total of four BRDY# signals for a burst read cycle to 
take place (See Figure 56 on page 197). 

Instruction-cache line fills initiate 32-byte transfers from 
memory (one burst cycle) on the bus. Data-cache line fills also 
initiate 32-b3^e transfers on the bus. If the data-cache line being 
filled replaces a modified line, the prior contents of the line are 
copied to a 32-byte writeback (copyback) buffer in the bus 
interface unit while the new line is being read. 
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Cache-Line Replacements 



As programs execute and task switches occur, some cache lines 
eventually require replacement. 

Instruction cache lines are replaced using a Least Recently 
Used (LRU) algorithm. If line replacement is required, lines are 
replaced when read cache misses occur 

The data cache uses a slightly different approach to line 
replacement. If a miss occurs, and a replacement is required, 
lines are replaced by using a Least Recently Allocated (LRA) 
algorithm. 

Two forms of cache misses and associated cache fills can take 
place — a sector replacement and a cache line replacement. In 
the case of a sector replacement, the miss is due to a tag 
mismatch, in which case the required cache line is filled from 
external memory, and the cache line within the sector that was 
not required is marked as invalid. In the case of a cache line 
replacement, the address matches the tag, but the requested 
cache line is marked as invalid. The required cache line is filled 
from external memory, and the cache line within the sector that 
is not required remains in the same cache state. 



lA^ite Allocate 



Write allocate, if enabled, occurs when the processor has a 
pending memory write cycle to a cacheable line and the line 
does not currently reside in the LI data cache. In this case, the 
processor performs a burst read cycle to fetch the data-cache 
line addressed by the pending write cycle. The data associated 
with the pending write cycle is merged with the 
recently-allocated data-cache line and stored in the processor's 
LI data cache. The final MESI state of the cache line depends 
on the state of the WBfWT^ and PWT signals during the burst 
read cycle and the subsequent cache write hit (See Table 55 on 
page 246 to determine the cache-line states and the access 
types following a cache read miss and cache write hit). 
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During write allocates, a 32-byte burst read cycle is executed in 
place of a non-burst write cycle. While the burst read cycle 
generally takes longer to execute than the write cycle, 
performance gains are realized on subsequent write cycle hits 
to the write-allocated cache line. Due to the nature of software, 
memory accesses tend to occur in proximity of each other 
(principle of locality). The likelihood of additional write hits to 
the write-allocated cache line is high. 

The following is a description of three mechanisms by which the 
AMD-K6 3D processor performs write allocations. A write 
allocate is performed when any one or more of these 
mechanisms indicates that a pending write is to a cacheable 
area of memory. 

Write to a Cacheable Page 

Every time the processor performs a cache line fill, the address 
of the page in which the cache line resides is saved in the 
Cacheability Control Register (CCR). The page address of 
subsequent write cycles is compared with the page address 
stored in the CCR. If the two addresses are equal, then the 
processor performs a write allocate because the page has 
already been determined to be cacheable. 

When the processor performs a cache line fill from a different 
page than the address saved in the CCR, the CCR is updated 
with the new page address. 

Write to a Sector 

If the address of a pending write cycle matches the tag address 
of a valid cache sector, but the addressed cache line within the 
sector is marked invalid (a sector hit but a cache line miss), 
then the processor performs a write allocate. The pending write 
cycle is determined to be cacheable because the sector hit 
indicates the presence of at least one valid cache line in the 
sector. The two cache lines within a sector are guaranteed by 
design to be within the same page. 
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Ufrite Allocate Limit 



The Write Handling Control Register (WHCR) is a MSR that 
contains three fields— the WCDE bit, the Write Allocate 
Enable Limit (WAELIM) field, and the Write Allocate Enable 
15-to-16-Mbyte (WAE15M) bit (See Figure 78). 

For proper functionality, always program the WCDE bit to 0. 
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Symbol Description Bits 

WCDE Always program to 0 8 - 

WAELIM Write Allocate Enable Limit 7-1 - 

WAE15M WriteAIIocate Enable lWo-16^byte 0 - 

Note: Har<^mre RSET irvdalkes this MSR to all zeros. 



HgureTS. Write Handling Control Register (WHCR) 



The WAELIM field is 7 bits wide. This field, multiplied by 4 
Mbytes, defines an upper memory limit. Any pending write 
cycle that addresses memory below this limit causes the 
processor to perform a write allocate. Write allocate is disabled 
for memory accesses at and above this limit unless the 
processor determines a pending write cycle is cacheable by 
means of one of the other write allocate mechanisms — Write to 
a Cacheable Page and Write to a Sector. The maximum value of 
this memory limit is {(2^ - 1) • 4 Mbytes) = 508 Mbytes. When all 
the bits in this field are set to 0, all memory is above this limit 
and this mechanism for allowing write allocate is effectively 
disabled. 

The Write Allocate Enable 15-to-16-Mbyte (WAE15M) bit is 
used to enable write allocations for the memory write cycles 
that address the 1 Mbyte of memory between 15 Mbytes and 16 
Mbytes. This bit must be set to 1 to allow write allocate in this 
memory area. This bit is provided to account for a small number 
of uncommon memory-mapped I/O adapters that use this 
particular memory address space. K the system contains one of 
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these peripherals, the bit should be set to 0. The WAE15M bit is 
ignored if the value in the WAEUM field is set to less than 16 
Mbytes. 

By definition a write allocate is never performed in the memory 
area between 640 Kbytes and 1 Mbyte unless the processor 
determines a pending write cycle is cacheable by means of one 
of the other write allocate mechanisms — Write to a Cacheable 
Page and Write to a Sector. It is not considered safe to perform 
write allocations between 640 Kbytes and 1 Mbyte (O0OA_0OO0h 
to OOOF_FFFFh) because it is considered a noncacheable region 
of memory. 

Figure 79 shows the logic flow for all the mechanisms involved 
with write allocate for memory bus cycles. The left side of the 
diagram (the text) describes the conditions that need to be true 
in order for the value of that line to be a 1. Items 1 to 3 of the 
diagram are related to general cache operation and items 4 to 
11 are related to the write allocate mechanisms. 

For more information about write allocate, see the 
Implementation of Write Allocate in the K86™ Processors 
Application Note^ document number 21326. 
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10) Write Allocate Enable 15-l6Mbyte(WAE15M)— C 




Figure 79. Write Allocate Logic Mechanisms and Conditions 



243 



177AMD0060279 



M Cache Organization 

Descriptions off the Logic Mechanisms and Conditions 

1. CD Bit of Ci?0— When the cache disable (CD) bit within 
control register 0 (CRO) is set to 1, the cache fill mechanism 
for both reads and writes is disabled, therefore write 
allocate does not occur. 

2. PCD Signal— When the PCD (page cache disable) signal is 
driven High, caching for that page is disabled even if KEN# 
is sampled asserted, therefore write allocate does not occur. 

3. CI Bit of TR12— When the cache inhibit bit of Test Register 
12 is set to 1, the LI caches are disabled, therefore write 
allocate does not occur. 

4. Write to a Cacheable Page (CCR)—A write allocate is 
performed if the processor knows that a page is cacheable. 
The CCR is used to store the page address of the last cache 
fill for a read miss. See "Write to a Cacheable Page" on page 
241 for a detailed description of this condition. 

5. Write to a Sector — A write allocate is performed if the 
address of a pending write cycle matches the tag address of a 
valid cache sector but the addressed cache line within the 
sector is invalid. See "Write to a Sector" on page 241 for a 
detailed description of this condition. 

6. WCDE Bit — For proper functionality, always program bit 8 
ofWHCR to 0. 

7. Less Than Limit fTV54£L/M)— The write allocate limit 
mechanism determines if the memory area being addressed 
is less than the limit set in the WAELIM field of WHCR. If 
the address is less than the limit, write allocate for that 
memory address is performed as long as conditions 9 and 10 
do not prevent write allocate. 

8. Between 640 Kbytes and 1 Mbyte — Write allocate is not 
performed in the memory area between 640 Kbytes and 1 
Mbyte. It is not considered safe to perform write allocations 
between 640 Kbytes and 1 Mbyte (0O0A_0OOOh to 
OOOF_FFFFh) because this area of memory is considered a 
noncacheable region of memory. 

9. Between 15-16 Mbytes — If the address of a pending write 
cycle is in the 1 Mbyte of memory between 15 Mbytes and 16 
Mbytes, and the WAE15M bit is set to 1, write allocate for 
this cycle is enabled. 
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10. Write Allocate Enable 15^16 Mbytes (WAE15M)— This 
condition is associated with the Write Allocate Limit 
mechanism and affects write allocate only if the limit 
specified by the WAELIM field is greater than or equal to 
16 Mbytes. If the memory address is between 15 Mbytes and 
16 Mbytes, and the WAE15M bit in the WHCR is set to 0, 
write allocate for this cycle is disabled. 



The AMD-K6 3D processor performs instruction cache 
prefetching for sector replacements only — as opposed to 
cache-line replacements. The cache prefetching results in the 
filling of the required cache line first, and a prefetch of the 
second cache line making up the other half of the sector. 
Furthermore, the prefetch of the second cache line is initiated 
only in the forward direction — that is, only if the requested 
cache line is the first position within the sector. From the 
perspective of the external bus, the two cache-line fills typically 
appear as two 32-byte burst read cycles occurring back-to-back 
or, if allowed, as pipelined cycles. The burst read cycles do not 
occur back-to-back (wait states occur) if the processor is not 
ready to start a new cycle, if higher priority data read or write 
requests exist, or if NA# (next address) was sampled negated. 
Wait states can also exist between burst cycles if the processor 
samples AHOLD or BOFF# asserted. 



Prefetching 
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Table 55 shows all the possible cache-line states before and 
after program-generated accesses to individual cache lines. The 
cable includes the correspondence between M£SI states and 
vnritethrough or writeback states for lines in the data cache. 



Table S5. Qata Cache States for Read and Write Accesses 



Type 


Cache State Before 
Access 


Access 
Type' 


Cache State After Access 


MESI State 


Writebad( 
Writethrough state 


Cache 
Read 


Read Miss 


invalid 


single read 


invalid 




invalid 


burst read^ 
(cacheable) 


shared or 

exclusive' 


writethrough or 
writeback' 


Read 
Hit 


shared 




shared 


writethrough 


exdusive 




exdusive 


writebacic 


modified 




modified 


writebadc 


Cache 
Write 


Write Miss 


invalid 


single write* 


invalid 




Write Hit 


shared 


cache update and 
single write 


shared or 
exdusive' 


writethrough or 
writeback' 


exdusive or modified 


cache update 


modified 


writeback 


Notes: 

I Single read, single write, cache update, and writethrough = 1 to8 bytes. Line fill = 32-byte burst read. 

2. // CACHED is driven Low and KEN^ is sampled asserted. 

3. If PWT is driven Low and WB/WT^' is sampled High, the line is cached in the exclusive (writeback) state. 

4. A write cycle occurs only if the write allocate core/ffons as spedfiGd in Alhcate" on page 240 are not /net 
- Not applicable or none 
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Cache Coherency 



Different ways exist to maintain coherency between the system 
memory and cache memories. Inquire cycles, internal snoops, 
FLUSH#, WBINVD, INVD, and line replacements all prevent 
inconsistencies between memories. 

Inquire Cycles 

Inquire cycles are bus cycles initiated by system logic. These 
inquiries ensure coherency between the caches and main 
memory. In systems with multiple caching masters, system logic 
maintains cache coherency by driving inquire cycles to the 
processor. System logic initiates inquire cycles by asserting 
AHOLD, BOFF#, or HOLD to obtain control of the address bus 
and then driving EADS#, INV (optional), and an inquire 
address (A[31:5]). This type of bus cycle causes the processor to 
compare the tags for both its instruction and data caches with 
the inquire address. If there is a hit to a shared or exclusive line 
in the data cache or a valid line in the instruction cache, the 
processor asserts HIT#. If the compare hits a modified line in 
the data cache, the processor asserts HIT# and HITM#. If 
HITM# is asserted, the processor writes the modified line back 
to memory. If INV was sampled asserted with EADS#, a hit 
invalidates the line. If INV was sampled negated with EAI)S#, a 
hit leaves the line in the shared state or transitions it from the 
exclusive or modified to shared state. 



Internal Snooping 

Internal snooping is initiated by the processor (rather than 
system logic) during certain cache accesses. It is used to 
maintain coherency between the LI instruction and data 
caches. 

The processor automatically snoops its instruction cache during 
read or write misses to its data cache, and it snoops its data 
cache during read misses to its instruction cache. Table 56 on 
page 249 summarizes the actions taken during this internal 
snooping. 
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If an internal snoop hits its target, the processor does the 
following: 

■ Data cache snoop during an instruction<ache read miss — If 
modified, the line in the data cache is written back to 
memory. Regardless of its state, the data-cache line is 
invalidated and the instruction cache performs a burst cycle 
read from memory. 

■ Instruction cache snoop during a data cache miss — The line in 
the instruction cache is marked invalid, and the data-cache 
read or write is performed from memory. 



In response to sampling FLUSH# asserted, the processor writes 
back any data cache lines that are in the modified state and 
then marks all lines in the instruction and data caches as 
invalid. 



These x86 instructions cause all cache lines to be marked as 
invalid. WBINVD writes back modified lines before marking all 
cache lines invalid. INVD does not write back modified lines. 



Replacing lines in the instruction or data cache, according to 
the line replacement algorithms described in "Cache-Line 
Fills" on page 239, ensures coherency between main memory 
and the caches. 

Table 56 on page 249 shows all possible cache-line states before 
and after cache snoop or invalidation operations performed 
with inquire cycles. This table shows all of the conditions for 
writethroughs and writebacks to memory. 



FLUSH# 



WBINVD and INVD 



Cache-Line Replacement 
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"Rible 56. Cache States for InquirieSr Snoops, Invalidation, and Replacement 



Type of Operation 


Cache State 
Before Operation 


Memory Access 


Cache State After Operation 


MESI State 


Wifiteback 
Writethrough State 


Inquire 
Cyde 


shared or 
exclusive 




INV=0 


shared 


writethrough 


INVsl 


invalid 


Invalid 


tnrtHiflpH 


burst write 
(writeback) 


INV=0 


shared 


writethrough 


INV=t 


invalid 


inifSiliH 
liivaiiu 


Internal 
Snoop 


shared or 
exclusive 




invalid 


invalid 


niodmed 


burst write 
(writeback) 


FLUSH* 
Signal 


shared or 
exclusive 




invalid 


invalid 


modified 


burst write 
(writeback) 


WBINVD 
Instruction 


shared or 
exclusive 




invalid 


invalid 


modified 


burst write 
(writeback) 


INVD 
Instruction 






invalid 


invalid 


Cache-L/ne 
Replacement 


shared or 
exclusive 




See Table 55 on page 246 


modified 


burst write 
(writeback) 


Notes: 

All writebads ore Sl-byte burst write cycles. 
- Not af^icable or none. 
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Cache Snooping 



Table 57 shows the conditions under which snooping occurs in 
the processor and the resources that are snooped. 

Tables?. Snoop Action 



lype of Event 


Type of Access 


Snooping Action 


Instruction 
Cache 


Data Cache 


Inquire Cyde 


System Logic 


yes^ 


yes' 


Internal Snoop 


Instruction 
Cache 


Read 
Miss 




yes^ 


Read 
Hit 




no 


Data 
Cache 


Read 
Miss 


yes^ 




Read 

Hit 


no 




Write 
Miss 


yes' 




Wrhe 
Hit 


no 




I The processor's response to an inquire cycle depends on the state of the INV input signai 
ana \he state of^e cadtie line as follows: 

For the instruction cache, if INV is sampled negated, the line remains invalid or valid, but 
if INV is sampled asserted, the line is invalidoted. 

For the data cache, if INV is sampled negated, valid lines remain in or transition to the 
shared state, a modified data cache line is written back before the line is marked shared 
(with HITM^ ass&ted), and invalid lines remain invalid. For the data cache, if INV is 
sampled asserted, the line is marked invafid ModidedSnes are written back before inval' 
idation. 

2. tf an internal snoop hits a modified line h the data cache, the line is written back and 
invalidated. Then the instnjction cache performs a burst read from memory. 

5. ffan internal snoop hits a line in the instruction cache, the instruction cache Sne is inval- 
idated and the data-cadie read or write is perfomted from memory. 

- Not applicable. 



250 



177AMD0060286 




Writethrough vs- Writeback Coherency States 



The terms writethrough and writeback apply to two related 
concepts in a read-write cache like the AMD-K6 3D processor 
LI data cache. The following conditions apply to both the 
writethrough and writeback modes: 

■ Memory Writes — A relationship exists between external 
memory writes and their concurrence with cache updates: 

• An external memory write that occurs concurrently with 
a cache update to the same location is a writethrough. 
Writethroughs are driven as single cycles on the bus. 

• An external memory write that occurs after the processor 
has modified a cache line is a writeback. Writebacks are 
driven as burst cycles on the bus. 

■ Coherency State — A relationship exists between MESI 
coherency states and writethrough-writeback coherency 
states of lines in the cache as follows: 

• Shared MESI lines are in the writethrough state. 

• Modified and exclusive MESI lines are in the writeback 
state. 

A20M# Masking of Cachd Accesses 



Although the processor samples A20M# as a level-sensitive 
input on every clock edge, it should only be asserted in Real 
mode. The processor applies the A20M# masking to its tags, 
through which all programs access the caches. Therefore, asser- 
tion of A20M# affects all addresses (cache and external mem- 
ory), including the following: 

■ Cache-line fills (caused by read misses) 

■ Cache writethroughs (caused by write misses or vmte hits to 
lines in the shared state) 

However, A20M# does not mask writebacks or invalidations 
caused by the following actions: 

■ Internal snoops 

■ Inquire cycles 

■ The FLUSH# signal 

■ The WBINVD instruction 
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Floating-Point and 
Multimedia Execution Units 

Floating-Point Execution Unit 



The AMD-K6 3D processor contains an IEEE 754-compatible 
and 854-compatible floating-point execution unit designed to 
accelerate the performance of software that utilizes the x86 
floating-point instruction set. Floating-point software is 
typically written to manipulate numbers that are very large or 
very small, that require a high degree of precision, or that result 
from complex mathematical operations such as 
transcendentals. Applications that take advantage of 
floating-point operations include geometric calculations for 
graphics acceleration, scientific, statistical, and engineering 
applications, and business applications that use large amounts 
of high-precision data. 

The high-performance floating-point execution unit contains an 
adder unit, a multiplier unit, and a divide/square root unit. 
These low-latency units can execute floating-point instructions 
in as few as two processor clocks. To increase performance, the 
processor is designed to simultaneously decode most 
floating-point instructions with most short-decodeable 
instructions. 

See Chapter 3, "Software Environment" on page 23 for a 
description of the floating-point data types, registers, and 
instructions. 
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Handling noatlng-Point Exceptions 

The processor provides the following two types of exception 
handling for floating-point exceptions: 

■ If the numeric error (NE) bit in CRO is set to 1, the processor 
invokes the interrupt lOh handler. In this manner, the 
floating-point exception is completely handled by software. 

■ If the NE bit in CRO is set to 0, the processor requires 
external logic to generate an interrupt on the INTR signal in 
order to handle the exception. 

External Logic Support of Floating-Point Exceptions 

The processor provides the FERR# (Floating-Point Error) and 
IGNNE# (Ignore Numeric Error) signals to allow the external 
logic to generate the interrupt in a manner consistent with 
IBM-compatible PC/AT systems. The assertion of FERR# 
indicates the occurrence of an unmasked floating-point 
exception resulting from the execution of a floating-point 
instruction. IGNNE# is used by the external hardware to control 
the effect of an unmasked floating-point exception. Under 
certain circumstances, if IGNNE# is sampled asserted, the 
processor ignores the floating-point exception. 

Figure 80 on page 255 illustrates an implementation of extemed 
logic for supporting floating-point exceptions. The following 
example explains the operation of the external logic in Figure 
80: 

As the result of a floating-point exception, the processor 
asserts FERR#. The assertion of FERR# and the 
sampling of IGNNE# negated indicates the processor has 
stopped instruction execution and is waiting for an 
interrupt. The assertion of FERR# leads to the assertion 
of INTR by the interrupt controller. The processor 
acknowledges the interrupt and jumps to the 
corresponding interrupt service routine in which an I/O 
write cycle to address port FOh leads to the assertion of 
IGNNE#. When IGNNE# is sampled asserted, the 
processor ignores the floating-point exception and 
continues instruction execution. When the processor 
negates FERR#, the external logic negates IGNNE#. 



See "FERR# (Floating-Point Error)" on page 158 and "IGNNE# 
(Ignore Numeric Exception)" on page 163 for more details. 
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AMD-K6 3D 
Processor 



FERR# 



IIMTR 

ICNNE# 



MO Address 
POftFOh 



RESET- 



IGNNE# 
Flip-Flop 

> CLOCK Q 

DATA Q 
QEAR 



FERR# 
Hip-Flop 

4> CLOCK Q 



DATA 
QEAR 



Figure 80. External Logic for Supporting Hoating-Point Exceptions 

Multimedia and 3D Execution Units 



Interrupt 
Controller 

IRQB 



The multimedia and 3D execution units of the processor are 
designed to accelerate the performance of software written 
using the industry-standard MMX instructions and the new 3D 
instructions. Applications that can take advantage of the MMX 
and 3D instructions include graphics, video and audio 
compression and decompression, speech recognition, and 
telephony applications. 

The MMX multimedia execution unit can execute MMX 
instructions in a single processor clock. All MMX and 3D 
arithmetic instructions are pipelined for higher performance. 
To increase performance, the processor is designed to 
simultaneously decode all MMX and 3D instructions with most 
other instructions. 

For more information on MMX instructions, see Appendix A, 
"MMX Multimedia Technology'* on page 347, For more 
information on 3D instructions, see Chapter 4, "3D Technology** 
on page 81. 
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Floating-Poiiit and MIVIV3P Instruction Compatibility 



Registers 

The eight 64-bit MMX registers (which are also utilized by 3D 

instructions) are mapped on the floating-point stack. This 
enables backward compatibility with all existing software. For 
example, the register saving event that is performed by 
operating systems during task switching requires no changes to 
the operating system. The same support provided in an 
operating system's interrupt 7 handler (Device Not Available) 
for saving and restoring the floating-point registers also 
supports saving and restoring the MMX registers. 

Exceptions 

There are no new exceptions defined for supporting the MMX 
and 3D instructions. All exceptions that occur while decoding 
or executing an MMX or 3D instruction are handled in existing 
exception handlers without modification. See "3D Exceptions" 
on page 93 for more information. 

FERR#andlGNNE# 

MMX instructions and 3D instructions do not generate 
floating-point exceptions. However, if an unmasked 
floating-point exception is pending, the processor asserts 
FERR# at the instruction boundary of the next floating-point 
instruction, MMX instruction, 3D instruction or WAIT 
instruction. 

The sampling of IGNNE# asserted only affects processor 
operation during the execution of an error-sensitive 
floating-point instruction, MMX instruction, 3D instruction or 
WAIT instruction when the NE bit in CRO is set to 0. 
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System Management Mode 
(SMM) 

Overview 



SMM is an alternate operating mode entered by way of a system 
management interrupt (SMI#) and handled by an interrupt 
service routine. SMM is designed for system control activities 
such as power management. These activities appear 
transparent to conventional operating systems like DOS and 
Windows. SMM is primarily targeted for use by the Basic Input 
Output System (BIOS) and specialized low-level device drivers. 
The code and data for SMM are stored in the SMM memory 
area, which is isolated from main memory. 

The processor enters SMM by the system logic's assertion of the 
SMI# interrupt and the processor's acknowledgment by the 
assertion of SMIACT#. At this point the processor saves its state 
into the SMM memory state-save area and jumps to the SMM 
service routine. The processor returns from SMM when it 
executes the RSM (resume) instruction from within the SMM 
service routine. Subsequently, the processor restores its state 
from the SMM save area, negates SMIACT#, and resumes 
execution with the instruction following the point where it 
entered SMM. 

The following sections summarize the SMM state-save area, 
entry into and exit from SMM, exceptions and interrupts in 
SMM, memory allocation and addressing in SMM, and the SMI# 
and SMIACT# signals. 
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SIWM Operating Mode and Default Register Values 



The software environment within SMM has the following 
characteristics: 

■ Addressing and operation in Real mode 



■ Default 16-bit operand, address, and stack sizes, although 
instruction prefixes can override these defaults 

■ Control transfers that do not override the default operand 
size truncate the EIP to 16 bits 

■ Far jumps or calls cannot transfer control to. a segment with 
a base address requiring more than 20 bits, as in^eal mode 
segment-base addressing 

■ A20M# is masked 

■ Interrupt vectors use the Real-mode interrupt vector table 

■ The IF flag in EFLAGS is cleared (INTR not recognized) 

■ The TF flag in EFLAGS is cleared 

■ The NMI and INIT interrupts are disabled 

■ Debug register DR7 is cleared (debug traps disabled) 

Figure 81 on page 259 shows the default map of the SMM 
memory area. It consists of a 64-Kbyte area, between 
0003_0000h and 0003_FFFFh, of which the top 32 Kbytes 
(0003_8000h to 0003_FFFFh) must be populated with RAM. 
The default code-segment (CS) base address for the area — 
called the SMM base address— is at 0003_0000h. The top 512 
bytes (0003_FE00h to 0003_FFFFh) contain a fill-down SMM 
state-save area. The default entry point for the SMM service 
routine is 0003_8000h. 



4-Gbyte segment limits 
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FilDown 



Service Routine Entiv Point 



SMM Base Address (G) 



Figure 81. SMM Memory 



SMM 
State-Save 
Area 



SMM 
Service Routine 



0005_FFFFh 
0003_FEOOh 



()003_8000h 



32-KbYte 
Minimum RAM 



0003.0000h 



Table 58 shows the initial state of registers when entering SMM. 
Table 58. Initial State of Registers in SMM 



Registers 


SMM Initial State 


General Purpose Registers 


unmodified 


EFLAGS 


0000_0002h 


CRO 


PE, EM, TS, and PC are cleared (bits 0, 2, 3, 
and 31). The other bits are unmodified. 


DR7 


0000_0400h 


GDTR.LDTR,IDTR,TS5R, DR6 


unmodified 


EIP 


OOOO^OOOh 


CS 


0003_0000h 


DS,ES.FS,CS,SS 


OOOO.OOOOh 
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SMM State-Save Area 



When the processor acknowledges an SMI# interrupt by 
asserting SMIACT#, it saves its state in a 512-byte SMM 
state-save area shown in Table 59. The save begins at the top of 
the SMM memory area (SMM base address + FFFFh) and fills 
down to SMM base address -i- FEOOh. 

Table 59 shows the offsets in the SMM state-save area relative 
to the SMM base address. The SMM service routine can alter 
any of the read/write values in the state-save area. 



Table 59. SMM State-Save Area Map 



Address Offset 


Contents Saved 


FFFCh 


CRO 


FFF8h 


CR3 


FFF4h 


EFLAGS 


FFFOh 


EIP 


FFECh 


EDI 


FFE8h 


ESI 


FFE4h 


EBP 


FFEOh 


ESP 


FFDCh 


EBX 


FFD8h 


EDX 


FFD4h 


EQ 


FFDOh 


EAX 


FFCCh 


DR6 


FFCSh 


DR7 


FFC4h 


TR 


FFCOh 


LDTR Base 


FFBCh 


GS 


FFBBh 


FS 


FFB4h 


DS 


FFBOh 


SS 


FFACh 


CS 


- No data dump at that address 

* Only contains information if SMI% is asserted during a valid I/O bus cyde. 
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Table 59. SMM State-Save Area Map (continued) 





Contents S^vmI 


FFA8h 


ES 


FFAAh 
rr/rtii 


if a Tran DwnrrI 


PPAnh 

rrnUn 




PPQTh 


l/n Tnin PIP* 


FPQflh 




rry*in 




rryun 


iriT R^acA 
IL/I Dd5c 


rroi-n 


irvr 1 imSt 


rroon 




rrcwn 




rrtsun 


TCC AHr 


rr/v.n 


TCC D;)rp 


rr/on 


tjj umii 


rr/*tn 




rr/un 


1 nX Utah 
LUI nlgn 


rrov-n 


LUI LOW 


rrbon 




rrMn 




rroun 


kjj LiTTlll 




rC AHr 

rb Attr 


rPSon 


CC D<«m 

Base 


rr54ri 


1*3 uniit 


FF50h 


DSAttr 


FF4Ch 


DS Base 


FF48h 


D5 Limit 


FF44h 


SSAttr 


FF40h 


SS Base 


FF3Ch 


SS Limit 


FF58h 


CSAltr 


FF34h 


CSBase 


FF50h 


CS Limit 


notes. 

- No data dump at ^at address 

* Only contains information ifSl^lf « asserted during a valid 110 bus cycle. 
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Table 59. SMM State-Save Area Map (continued) 



Address Offset 


Contents Saved 


FF2Ch 


ESAttr 


FF28h 


ES Base 


FF24h 


ES Limit 


FF20h 


- 


FFlCh 


- 


FF18h 


- 


FF14h 


CR2 


FFlOh 


CR4 


FFOCh 


I/O Restart ESI* 


FFoah 


I/O Restart ECX* 


FF04h 


I/O Restart EDI* 


FF02h 


HALT Restart Slot 


FFOOh 


I/O Trap Restart Slot 


FEFCh 


SMM RevID 


FEF8h 


SMM Base 


FEF7h-FE00h 




Notes: 

- No data dump at that address 

* Only contains information ifSMt* is asserted during a voiki I/O bus cycle. 



SMM Revision Identifier 



The SMM revision identifier at offset FEFCh in the SMM 
state-save area specifies the version of SMM and the extensions 
that are available on the processor. The SMM revision identifier 
fields are as follows: 

■ Bits 31-1 S— Reserved 

■ Bit 17 — SMM base address relocation (1 = enabled) 

■ Bit i6— I/O trap restart (1 = enabled) 

■ Bits 15-0 — SMM revision level for the AMD-K6 3D processor 
= 0002h 
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Table 60 shows the format of the SMM Revision Identifier. 



Table 60. SMM Revision Identifier 



31-18 


17 


16 


15-0 


Reserved 


SMM Base Relocation 


I/O Trap Extension 


SMM Revision Level 


0 


1 


1 


0002h 



SMM Base Address 



During RESET, the processor sets the base address of the 
code>segment (CS) for the SMM memory area — the SMM base 
address— to its default, 0003_0000h. The SMM base address at 
offset FEF8h in the SMM state-save area can be changed by the 
SMM service routine to any address that is aligned to a 
32-Kbyte boundary. (Locations not aligned to a 32-Kbyte 
boundary cause the processor to enter the Shutdown state when 
executing the RSM instruction.) 

In some operating environments it may be desirable to relocate 
the 64-Kbyte SMM memory area to a high memory area in order 
to provide more low memory for legacy software. During system 
initialization, the base of the 64-Kbyte SMM memory area is 
relocated by the BIOS. To relocate the SMM base address, the 
system enters the SMM handler at the default address. This 
handler changes the SMM base address location in the SMM 
state-save area> copies the SMM handler to the new location, 
and exits SMM. 

The next time SMM is entered, the processor saves its state at 
the new base address. This new address is used for every SMM 
entry until the SMM base address in the SMM state-save area is 
changed or a hardware reset occurs. 
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HaK Restart Slot 



During entry into SMM, the halt restart slot at offset FF02h in 
the SMM state-save area indicates if SMM was entered from the 
Halt state. Before returning from SMM, the halt restart slot 
(offset FF02h) can be written to by the SMM service routine to 
specify whether the return from SMM takes the processor back 
to the Halt state or to the next instruction after the HLT 
instruction. 

Upon entry into SMM, the halt restart slot is defined as follows: 

■ Bits 15-1 — Reserved 

■ Bit 0— Point of entry to SMM: 
1 = entered from Halt state 

0 = not entered from Halt state 

After entry into the SMI handler and before returning from 
SMM, the halt restart slot can be written using the following 
definition: 

■ Bits 15-1 — Reserved 

■ Bit 0 — Point of return when exiting from SMM: 

1 = return to Halt state 

0 = return to next instruction after the HLT instruction 

If the return from SMM takes the processor back to the Halt 
state, the HLT instruction is not re-executed, but the Halt 
special bus cycle is driven on the bus after the return. 
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If the assertion of SMI# is recognized during the execution of an 
I/O instruction, the I/O trap dword at offset FFA4h in the SMM 
state-save area contains information about the instruction. The 
fields of the I/O trap dword are configured as follows: 

■ Bits 31-1 6 — I/O port address 

■ Bits 15-4 — Reserved 

■ Bit J— REP (repeat) string operation (1 = REP string, 0 = not 

a REP string) 

■ Bit 2—1/0 string operation (1 = I/O string, 0 = not an I/O 

string) 

■ Bit 1— Valid I/O instruction (1 = valid, 0 = invalid) 

■ Bit 0— Input or output instruction (1 = INx, 0 = OUTx) 

Table 61 shows the format of the I/O trap dword. 



TaUe 61. I/O Trap Dword Configuration 



Jl-16 


15-4 


3 


2 


1 


0 


I/O Port 
Address 


Reserved 


REP String 
Operation 


I/O String 
Operation 


Valid I/O 
Instruction 


Input or 
Output 



The I/O trap dword is related to the I/O trap restart slot (see "I/O 
Trap Restart Slot" on page 266). If bit 1 of the I/O trap dword is 
set by the processor, it means that SMI# was asserted during the 
execution of an I/O instruction. The SMI handler tests bit 1 to 
see if there is a valid I/O instruction trapped. If the I/O 
instruction is valid, the SMI handler is required to ensure the 
I/O trap restart slot is set properly. The I/O trap restart slot 
informs the processor whether it should re-execute the I/O 
instruction after the RSM or execute the instruction following 
the trapped I/O instruction. 

Note: If SMIU is sampled asserted during an I/O bus cycle a 
minimum of three clock edges before BRDY^ is sampled 
asserted, the associated I/O instruction is guaranteed to be 
trapped by the SMI handler. 
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The VO trap restart slot at offset FFOOh in the SMM state-save 
area specifies whether the trapped I/O instruction should be 
re-executed on return from SMM. This slot in the state-save area 
is called the I/O instruction restart function. Re-executing a 
trapped I/O instruction is useful, for example, if an I/O write 
occurs to a disk that is powered down. The system logic 
monitoring such an access can assert SMI#. Then the SMM 
service routine would query the system logic, detect a failed I/O 
write, take action to power-up the I/O device, enable the I/O 
trap restart slot feature, and return from SMM. 

The fields of the I/O trap restart slot are defined as follows: 

■ Bits 31-16 — Reserved 

■ Bits 15-0 — I/O instruction restart on return from SMM: 

OOOOh = execute the next instruction after the trapped 
I/O instruction 

OOFFh = re-execute the trapped I/O instruction 
Table 62 shows the format of the I/O trap restart slot. 



Table 62. VO Trap Restart Slot 



51-16 


15-0 


Reserved 


I/O Instruction restart on return from SMM: 

■ OOOOh = execute the next instruction after the trapped I/O 

■ OOFFh a re-execute the trapped I/O instruction 



The processor initializes the I/O trap restart slot to OOOOh upon 
entry into SMM. If SMM was entered due to a trapped I/O 
instruction, the processor indicates the validity of the I/O 
instruction by setting or clearing bit 1 of the I/O trap dword at 
offset FFA4h in the SMM state-save area. The SMM service 
routine should test bit 1 of the I/O trap dword to determine if a 
valid I/O instruction was being executed when entering SMM 
and before writing the I/O trap restart slot. If the I/O instruction 
is valid, the SMM service routine can safely rewrite the I/O trap 
restart slot with the value OOFFh, which causes the processor to 
re-execute the trapped I/O instruction when the RSM 
instruction is executed. If the I/O instruction is invalid, writing 
the I/O trap restart slot has undefined results. 
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If a second SMI# is asserted and a valid I/O instruction was 
trapped by the first SMM handler, the processor services the 
second SMI# prior to re-executing the trapped I/O instruction. 
The second entry into SMM never has bit 1 of the I/O trap dword 
set, and the second SMM service routine must not rewrite the 
I/O trap restart slot. 

During a simultaneous SMI# I/O instruction trap and debug 
breakpoint trap, the AMD-K6 3D processor first responds to the 
SMI# and postpones recognizing the debug exception until 
after returning from SMM via the RSM instruction. If the debug 
registers DR3-DR0 are used while in SMM, they must be saved 
and restored by the SMM handler. The processor automatically 
saves and restores DR7-DR6. If the I/O trap restart slot in the 
SMM state-save area contains the value OOFFh when the RSM 
instruction is executed, the debug trap does not occur until 
after the I/O instruction is re-executed. 



Exceptions, Interrupts, and Debug in SMM 



During an SMI# I/O trap, the exception/interrupt priority of the 
processor changes from its normal priority. The normal priority 
places the debug traps at a priority higher than the sampling of 
the FLUSH# or SMI# signals. However, during an SMI# I/O trap, 
the sampling of the FLUSH# or SMI# signals takes precedence 
over debug traps. 

The processor recognizes the assertion of NMI within SMM 
immediately after the completion of an IRET instruction. Once 
NMI is recognized within SMM, NMI recognition remains 
enabled until SMM is exited, at which point NMI masking is 
restored to the state it was in before entering SMM. 
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II 

Test and Debug 



The AMD-K6 3D processor implements various test and debug 
modes to enable the functional and manufacturing testing of 
systems and boards that use the processor. In addition, the 
debug features of the processor allow designers to debug the 
instruction execution of software components. This chapter 
describes the following test and debug features: 

■ Built-in Self-Test (BIST)— The BIST, which is invoked after 
the falling transition of RESET, runs internal tests that 
exercise most on-chip RAM structures. 

■ TH-State Test Mode — A test mode that causes the processor 
to float its output and bidirectional pins. 

■ Boundary-Scan Test Access Port (TAP) —The Joint Test Action 
Group (JTAG) test access fimction defined by the IEEE 
Standard Test Access Port and Boundary-Scan Architecture 
(IEEE 1149.1-1990) specification. 

■ Level-One (LI) Cache Inhibit — A feature that disables the 
processor's internal LI instruction and data caches. 

■ Debug Support — Consists of all x86-compatible software 
debug features, including the debug extensions. 
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Following the falling transition of RESET, the processor 
unconditionally runs its BIST. The internal resources tested 
during BIST include the following: 

■ LI instruction and data caches 

■ Instruction and Data Translation Lookaside Buffers (TLBs) 

The contents of the EAX general-purpose register after the 
completion of reset indicate if the BIST was successful. If EAX 
contains 0000_0000h, then BIST was successful. If EAX is 
non-zero, the BIST failed. Following the completion of the BIST, 
the processor jumps to address FFFF_FFFOh to start 
instruction execution, regardless of the outcome of the BIST. 

The BIST takes approximately 295,000 processor clocks to 
complete. 

TrI-State Test Mode 



The Tri-State Test mode causes the processor to float its output 
and bidirectional pins, which is useful for board-level 
manufacturing testing. In this mode, the processor is 
electrically isolated from other components on a system board, 
allowing automated test equipment (ATE) to test components 
that drive the same signals as those the processor floats. 

If the FLUSH# signal is sampled Low during the falling 
transition of RESET, the processor enters the Tri-State Test 
mode. (See "FLUSH* (Cache Flush)" on page 159 for the 
specific sampling requirements.) The signals floated in the 
Tri-State Test mode are as follows: 



■ 


A[31:3] 


■ 


D/C# 


■ 


M/IO# 


■ 


ADS# 


■ 


D[63:0] 


■ 


PCD 


■ 


ADSC# 


■ 


DP[7:0] 


■ 


PCHK# 


■ 


AP 


■ 


FERR# 


■ 


PWT 


■ 


APCHK# 


■ 


HIT# 


■ 


SCYC 


■ 


BE[7:0]# 


■ 


HITM# 


■ 


SMIACT# 


■ 


BREQ 


■ 


HLDA 


■ 


W/R# 


■ 


CACHE# 


■ 


LOCK# 
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The VCC2DET, VCC2H/L#, and TDO signals are the only 
outputs not floated in the Tri-State Test mode. VCC2DET and 
VCC2H/L# must remain Low to ensure the system continues to 
supply the specified processor core voltage to the ^qci pi^^s. 
TDO is never floated because the Boundary-Scan Test Access 
Port must remain enabled at all times, including during the 
Tri-State Test mode. 

The Tri-State Test mode is exited when the processor samples 
RESET asserted. 

Boundary-Scan Test Access Port (TAP) 



The boundary-scan Test Access Port (TAP) is an IEEE standard 
that defines synchronous scanning test methods for complex 
logic circuits, such as boards containing a processor. The 
AMD-K6 3D processor supports the TAP standard defined in 
the IEEE Standard Test Access Port and Boundary-Scan 
Architecture (IEEE 1149.1-1990) specification. 

Boundary scan testing uses a shift register consisting of the 
serial interconnection of boundary-scan cells that correspond to 
each lyO buffer of the processor. This non-inverting register 
chain, called a Boundary Scan Register (BSR), can be used to 
capture the state of every processor pin and to drive every 
processor output and bidirectional pin to a known state. 

Each BSR of every component on a board that implements the 
boundary-scan architecture can be serially interconnected to 
enable component interconnect testing. 

Test Access Port 

The TAP consists of the following: 

■ Test Access Port (TAP) Controller^The TAP controller is a 
synchronous, finite state machine that uses the TMS and 
TDI input signals to control a sequence of test operations. 
See "TAP Controller State Machine" on page 278 for a list 
of TAP states and their definition. 

■ Instruction Register (IR) — The IR contains the instructions 
that select the test operation to be performed and the Test 
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Data Register (TDR) to be selected. See "TAP Registers" on 
page 273 for more details on the IR. 

■ Test Data Registers (TDR)— The three TDRs are used to 
process the test data. Each TDR is selected by an 
instruction in the Instruction Register (IR). See "TAP 
Registers" on page 273 for a list of these registers and their 
functions. 

TAP Signals 

The test signals associated with the TAP controller are as 
follows: 

■ rCK— The Test Clock for all TAP operations. The rising edge 
of TCK is used for sampling TAP signals, and the falling 
edge of TCK is used for asserting TAP signals. The state of 
the TMS signal sampled on the rising edge of TCK causes 
the state transitions of the TAP controller to occur. TCK can 
be stopped in the logic 0 or 1 state. 

■ ri>/— The Test Data Input represents the input to the most 
significant bit of all TAP registers, including the IR and all 
test data registers. Test data and instructions are serially 
shifted by one bit into their respective registers on the rising 
edge of TCK. 

■ roO— The Test Data Output represents the output of the 
least significant bit of all TAP registers, including the IR and 
all test data registers. Test data and instructions are serially 
shifted by one bit out of their respective registers on the 
falling edge of TCK. 

■ IMS — The Test Mode Select input specifies the test 
function and sequence of state changes for boundary-scan 
testing. If TMS is sampled High for five or more consecutive 
clocks, the TAP controller enters its reset state. 

■ TRSr# — The Test Reset signal is an asynchronous reset that 
unconditionally causes the TAP controller to enter its reset 
state. 

Refer to Chapter 14, "Electrical Data" on page 303 and Chapter 
16, "Signal Switching Characteristics" on page 313 to obtain 
the electrical specifications of the test signals. 



272 



177AMD0060308 



Test and Debug 1 1 

TAP Registers 

The processor provides an Instruction Register (IR) and three 
Test Data Registers (TDR) to support the boundary-scan 
architecture. The IR and one of the TDRs — the Boundaiy-Scan 
Register (BSR)— consist of a shift register and an output 
register. The shift register is loaded in parallel in the Capture 
states. (See "TAP Controller State Machine" on page 278 for a 
description of the TAP controller states.) In addition, the shift 
register is loaded and shifted serially in the Shift states. The 
output register is loaded in parallel from its corresponding shift 
register in the Update states. 

The IR is a 5-bit register, without parity, that determines which 
instruction to run and which test data register to select. When 
the TAP controller enters the Capture-IR state, the processor 
loads the following bits into the IR shift register: 

■ 01b — Loaded into the two least significant bits, as specified 
by the IEEE 1149.1 standard 

■ OQOb — Loaded into the three most significant bits 

Loading 00001b into the IR shift register during the Capture-IR 
state results in loading the SAMPLE/PRELOAD instruction. 

For each entry into the Shift-IR state, the IR shift register is 
serially shifted by one bit toward the TDO pin. During the shift, 
the most significant bit of the IR shift register is loaded from 
the TDI pin. 

The IR output register is loaded from the IR shift register in the 
Update-IR state, and the current instruction is defined by theIR 
outputregister. See "TAP Instructions" on page 27 7 for a list and 
definition of the instructions supported by the processor. 

BoundaryScan The BSR is a Test Data Register consisting of the 

Register (BSR) interconnection of 152 boundary-scan cells. Each output and 

bidirectional pin of the processor requires a two-bit cell, where 
one bit corresponds to the pin and the other bit is the output 
enable for the pin. When a 0 is shifted into the enable bit of a 
cell, the corresponding pin is floated, and when a 1 is shifted 
into the enable bit, the pin is driven valid. Each input pin 
requires a one-bit cell that corresponds to the pin. The last cell 
of the BSR is reserved and does not correspond to any processor 
pin. 
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The total niimber of bits that comprise the BSR is 281. Table 63 
on page 275 lists the order of these bits, where TDI is the input 
to bit 280, and TDO is driven from the output of bit 0. The 
entries listed as pinJE (where pin is an output or bidirectional 
signal) are the enable bits. 

If the BSR is the register selected by the current instruction 
and the TAP controller is in the Capture-DR state, the processor 
loads the BSR shift register as follows: 

■ If the current instruction is SAMPLE/PRELOAD, then the 
current state of each input, output, and bidirectional pin is 
loaded. A bidirectional pin is treated as an output if its 
enable bit equals 1, and it is treated as an input if its enable 
bit equals 0. 

■ If the current instruction is EXTEST, then the current state 
of each input pin is loaded. A bidirectional pin is treated as 
an input, regardless of the state of its enable. 

While in the Shift-DR state, the BSR shift register is serially 
shifted toward the TDO pin. During the shift, bit 280 of the BSR 
is loaded from the TDI pin. 

The BSR output register is loaded with the contents of the BSR 
shift register in the Update-DR state. If the current instruction 
is EXTEST, the processor's output pins, as well as those 
bidirectional pins that are enabled as outputs, are driven with 
their corresponding values from the BSR output register. 
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Table 63. Boundary Scan Bit Definitions 



Bit 


Pin/Enable 


Bit 


Pin/Enable 


Bit 


Pin/Enable 


Bit 


Pin/Enable 


Bit 


Pin/Enable 


Bit 


Pin/Enable 


280 


D55_E 


247 


D21 


214 


D4_E 


181 


A3 


148 


A20 


115 


A16 


279 


D35 


246 


D18_E 


213 


D4 


180 


A31J 


147 


A13J 


114 


FERRJ 


278 


D29_E 


245 


D18 


212 


DPOJ 


179 


A31 


146 


A13 


113 


FERR# 


277 


D29 


244 


D19_E 


211 


DPO 


178 


A2UE 


145 


DP7_E 


112 


HITJ 


276 


D35_E 


243 


D19 


210 


HOLD 


177 


A21 


144 


DP7 


111 


HIT# 


275 


D33 


242 


D16__E 


209 


BOFF# 


176 


mjE 


143 


BE6J 


110 


BE7_E 


274 


D27_E 


241 


D16 


208 


AHOLD 


175 


A30 


142 


BE6# 


109 


BE7# 


273 


D27 


240 


D)7J 


207 


STPCLK# 


174 


A7_E 


141 


A12_E 


108 


NA# 


272 


DP3_E 


239 


D17 


206 


INIT 


173 


A7 


140 


A12 


107 


ADSCJ 


271 


DPS 


238 


D15_E 


205 


IGNNE# 


172 


A24__E 


139 


CLK 


106 


ADSC# 


270 


D25_E 


237 


D15 


204 


BFl 


171 


A24 


138 


BE4 E 


105 


BE5_E 


269 


D25 


236 


DP1_E 


203 


BF2 


170 


A18J 


137 


BE4# 


104 


BE5# 


268 


DO_E 


235 


DPI 


202 


RESET 


169 


A18 


136 


A10_E 


103 


WB/WT# 


267 


DO 


234 


D13J 


201 


BFO 


168 


A5_E 


135 


AlO 


102 


PWTJ 


266 


D30_E 


233 


D13 


200 


FLUSH* 


167 


A5 


154 


D63_E 


101 


pv\nr 


265 


D30 


232 


D6_E 


199 


INTR 


166 


A22J 


133 


D63 


100 


BE3_E 


264 


DP2_E 


231 


06 


198 


NMI 


165 


A22 


132 


BE2^E 


99 


BE3# 


263 


DP2 


230 


D14_E 


197 


SMI# 


164 


EADS# 


131 


BE2# 


98 


BREO-E 


262 


D2_E 


229 


DH 


195 


A25_E 


163 


A4_E 


130 


A15_E 


97 


BREQ 


261 


02 


228 


Dn_E 


195 


A25 


162 


A4 


129 


A15 


96 


PCD_E 


260 


D2B_E 


227 


D11 


194 


A23J 


161 


HITM_E 


128 


BRDY# 


95 


PCD 


259 


D28 


226 


D1_E 


195 


A23 


160 


HITM# 


127 


BELE 


94 


WR^E 


258 


D24_E 


225 


Dl 


192 


A26J 


159 


A9_E 


126 


BE1# 


93 


W/R# 


257 


D24 


224 


D12_E 


191 


A26 


158 


A9 


125 


A14J 


92 


SMIAaj 


256 


D26_E 


223 


D12 


190 


A29J 


157 


sacj 


124 


A14 


91 


SMlAa# 


255 


D26 


222 


D10_E 


189 


A29 


156 


sac 


123 


BRDYC# 


90 


EWBE# 


254 


D22_E 


221 


DID 


188 


A28_E 


155 


A8_E 


122 


BEOJ 


89 


DC_E 


253 


D22 


220 


D7^E 


187 


A28 


154 


A8 


121 


BEO# 


88 


D/C# 


252 


D25_E 


219 


D7 


185 


A27_E 


153 


A19J 


120 


A17J 


87 


APCHK_E 


251 


D25 


218 


D8.E 


185 


A27 


152 


A19 


119 


A17 


86 


APCHK# 


250 


D20_E 


217 


D8 


184 


AllJ 


151 


A6_E 


118 


KEN# 


85 


CACHEJ 


249 


D20 


215 


D9_E 


183 


All 


150 


A6 


117 


A20M# 


84 


CACHE# 


248 


D21_E 


215 


D9 


182 


A3_E 


149 


1 A20.E 


116 


A16.E 


83 


ADSJ 
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TaUe 63. Boundary Scan Bit Definitions (continued) 



Bit 


Pin/Enable 


Bit 


Pin/Enable 


Bit 


Pin/Enable 


Bit 


Pin/Enable 


Bit 


Pin/EnaMe 


Bit 


Pin/Enable 


82 


ADS# 


68 


DP6_E 


54 


D53_E 


40 


D43_E 


26 


D38_E 


12 


D3_E 


61 


APJ 


67 


DP6 


53 


D53 


39 


D43 


25 


D38 


11 


D3 


80 


AP 


66 


D54^E 


52 


D47_E 


38 


D62_E 


24 


D58_E 


10 


D39_E 


79 


INV 


65 


D54 


51 


D47 


57 


D62 


23 


D58 


9 


D39 


78 


HLDA_E 


64 


D50_E 


50 


D59_E 


36 


D49_E 


22 


D42_E 


8 


D32_E 


77 


HLDA 


63 


D50 


49 


D59 


35 


D49 


21 


D42 


7 


D32 


75 


PCHKJ 


62 


D56,E 


48 


D5U 


34 


DP4J 


20 


D36J 


6 


D5_E 


75 


PCHK# 


61 


D56 


47 


D51 


33 


DP4 


19 


D36 


5 


05 


74 


LOCKJ 


60 


D55^E 


46 


D45_E 


32 


D46_E 


18 


D60_E 


4 


D37_E 


73 


LOCK# 


59 


D55 


45 


D45 


31 


D46 


17 


D60 


3 


D37 


72 


moj 


58 


D48_E 


44 


D6U 


30 


D41_E 


16 


D40_E 


2 


D31_E 


71 


M/IO# 


57 


D48 


43 


D6I 


29 


D41 


15 


D40 


1 


D31 


70 


D52_E 


56 


D57J 


42 


DP5J 


28 


D44_E 


14 


D34_E 


0 


Reserved 


69 


D52 


55 


D57 


41 


DPS 


27 


D44 


13 


D34 







Device Identification The DIR is a 32-bit Test Data Register selected during the 
Register (DIR) execution of the IDCODE instruction. The fields of the DIR and 

their values are shovm in Table 64 and are defined as follows: 

■ Version Code— This 4-bit field is incremented by AMD 
manufacturing for each major revision of silicon. 

■ Part Number — This 16-bit field identifies the specific 
processor model. 

■ Manufacturer — This 11-bit field identifies the manufacturer 

of the component (AMD). 

■ LSB — The least significant bit (LSB) of the DIR is always set 
to 1, as specified by the IEEE 1149.1 standard. 



Table 64. Device Identification Register 



Version Code 
(Bits 31-28) 


Part Number 
(Bits 27-12) 


Manufacturer 
(Bits 11-1) 


LSB 
(BitO) 


Xh 


OS&Oh 


00000000001 b 


lb 
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Bypass Register (BR) The BR is a Test Data Register consisting of a 1-bit shift register 
that provides the shortest path between TDI and TDO. When 
the processor is not involved in a test operation, the BR can be 
selected by an instruction to allow the transfer of test data 
through the processor without having to serially scan the test 
data through the BSR. This functionality preserves the state of 
the BSR and significantly reduces test time. 

The BR register is selected by the BYPASS and HIGHZ 
instructions as well as by any instructions not supported by the 
processor. 

TAP Instructions 

The processor supports the three instructions required by the 
IEEE 1149.1 standard— EXTEST, SAMPLE/PRELOAD, and 
BYPASS — as well as two additional optional instructions — 
IDCODE and HIGHZ. 

Table 65 shows the complete set of TAP instructions supported 
by the processor along with the 5-bit Instruction Register 
encoding and the register selected by each instruction. 



Table 65. Supported Tap Instructions 



Instruction 


Encoding 


Register 


Description 


EXTEST' 


00000b 


BSR 


Sample inputs and drive outputs 


SAMPLE/ PRELOAD 


00001b 


BSR 


Sample inputs and outputs, then load the BSR 


IDCODE 


00010b 


DIR 


Read DIR 


HIGHZ 


00011b 


BR 


Float outputs and bidirectional pins 


BYPASS^ 


OOlOOb-IUlOb 


BR 


Undefined instruction, execute the BYPASS instruction 


BYPASS^ 


lllltb 


BR 


Connect TDI to TDO to bypass the BSR 


Notes: 

I Folbwing the execution of the EXTEST instruction, the processor must be reset in order to return to normal, non-test operatbn. 
Z These instruction encodings are undefined on the AMD-K6 5D processor and default to the BYPASS instruction. 
3. Because the TDf input contains an internal pullup, the BYPASS instruction is executed if the TDI input is not corinected or open 
during an instruction scon operation. The BYPASi instruction does not affect the normal operational state of the processor. 



EXTEST When the EXTEST instruction is executed, the processor loads 

the BSR shift register with the current state of the input and 
bidirectional pins in the Capture-DR state and drives the 
output and bidirectional pins with the corresponding values 
from the BSR output register in the Update-DR state. 
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SAMPIE/PRELOAD The SAMPLE/PRELOAD instruction performs two functions. 
These functions are as follows: 

■ During the Capture-DR state, the processor loads the BSR 
shift register with the current state of every input, output, 
and bidirectional pin. 

■ During the Update-DR state, the BSR output register is 
loaded from the BSR shift register in preparation for the 
next EXTEST instruction. 

The SAMPLE/PRELOAD instruction does not affect the normal 
operational state of the processor. 

The BYPASS instruction selects the BR register, which reduces 
the boundary-scan length through the processor from 281 to one 
(TDI to BR to TDO). The BYPASS instruction does not affect the 
normal operational state of the processor 

The IDCODE instruction selects the DIR register, allowing the 
device identification code to be shifted out of the processor. 
This instruction is loaded into the IR when the TAP controller is 
reset. The IDCODE instruction does not affect the normal 
operational state of the processor. 

The HIGHZ instruction forces all output and bidirectional pins 
to be floated. During this instruction, the BR is selected and the 
normal operational state of the processor is not affected. 

TAP Controller State Machine 

The TAP controller state diagram is shown in Figure 82 on page 
279. State transitions occur on the rising edge of TCK. The logic 
0 or 1 next to the states represents the value of the TMS signal 
sampled by the processor on the rising edge of TCK. 
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IEEE Std 1M9.M990, Copyright© 1990. lEEL Ml rights reseived 

Figure 82. TAP State Diagram 
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The states of the TAP controller are described as follows: 

Test-Logic-Reset This state represents the initial reset state of the TAP controller 

and is entered when the processor samples RESET asserted, 
when TRST# is asynchronously asserted, and when TMS is 
sampled High for five or more consecutive clocks. In addition, 
this state can be entered from the Select-IR-Scan state. The JR 
is initialized with the IDCODE instruction, and the processor's 
normal operation is not affected in this state. 

Captore-DR During the SAMPLE/PRELOAD instruction, the processor 

loads the BSR shift register with the current state of every 
input, output, and bidirectional pin. During the EXTEST 
instruction, the processor loads the BSR shift register with the 
current state of every input and bidirectional pin. 

When the TAP controller enters the Capture-IR state, the 
processor loads 01b into the two least significant bits of the IR 
shift register and loads 000b into the three most significant bits 
of the IR shift register. 

While in the Shift-DR state, the selected TDR shift register is 
serially shifted toward the TDO pin. During the shift, the most 
significant bit of the TDR is loaded from the TDI pin. 

While in the Shift-IR state, the IR shift register is serially 
shifted toward the TDO pin. During the shift, the most 
significant bit of the IR is loaded from the TDI pin. 

During the SAMPLE/PRELOAD instruction, the BSR output 
register is loaded with the contents of the BSR shift register. 
During the EXTEST instruction, the output pins, as well as 
those bidirectional pins defined as outputs, are driven with 
their corresponding values from the BSR output register. 

In this state, the IR output register is loaded from the IR shift 
register, and the current instruction is defined by the IR output 
register. 



Captare-IR 

Shift-DR 
Shlft-IR 
Update-DR 

Update-IR 
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The following states have no effect on the normal or test 
operation of the processor other than as shown in Figure 82 on 
page 279: 

■ Run-Test/Idle — This state is an idle state between scan 
operations. 

■ Select-DR-Scan — This is the initial state of the test data 
register state transitions. 

■ Select-IR-Scan — This is the initial state of the Instruction 
Register state transitions. 

■ Exitl-DR — This state is entered to terminate the shifting 
process and enter the Update-DR state. 

■ Exitl-IR — This state is entered to terminate the shifting 
process and enter the Update-IR state. 

■ Pause-DR — This state is entered to temporarily stop the 

shifting process of a Test Data Register. 

■ Pause-IR — This state is entered to temporarily stop the 
shifting process of the Instruction Register. 

■ Exit2-DR — This state is entered in order to either terminate 
the shifting process and enter the Update-DR state or to 
resimie shifting following the exit from the Pause-DR state. 

■ Exit2-IR — This state is entered in order to either terminate 
the shifting process and enter the Update-IR state or to 
resume shifting following the exit from the Pause-IR state. 
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Purpose 

The AMD-K6 3D processor provides a means for inhibiting the 
normal operation of its LI instruction and data caches while 
still supporting an external Level-2 (L2) cache. This capability 
allows system designers to disable the LI cache during the 
testing and debug of an L2 cache. 

If the Cache Inhibit bit (bit 3) of Test Register 12 (TR12) is set 
to 0, the processor's LI cache is enabled and operates as 
described in Chapter 8, "Cache Organization** on page 233. If 
the Cache Inhibit bit is set to 1, the LI cache is disabled and no 
new cache lines are allocated. Even though new allocations do 
not occur, valid LI cache lines remain valid and are read by the 
processor when a requested address hits a cache line. In 
addition, the processor continues to support inquire cycles 
initiated by the system logic, including the execution of 
writeback cycles when a modified cache line is hit. 

While the LI is inhibited, the processor continues to drive the 
PCD output signal appropriately, which system logic can use to 
control external L2 caching. 

In order to completely disable the LI cache so no valid lines 
exist in ttie cache, the Cache Inhibit bit must be set to 1 and the 
cache must be flushed in one of the following ways: 

■ By asserting the FLUSH# input signal 

■ By executing the WBINVD instruction 

■ By executing the INVD instruction (modified cache lines are 
not written back to memory) 
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Debug 



The processor implements the standard x86 debug functions, 
registers, and exceptions. In addition, the processor supports 
the I/O breakpoint debug extension. The debug feature assists 
programmers and system designers during software execution 
tracing by generating exceptions when one or more events 
occur during processor execution. The exception handler, or 
debugger, can be written to perform various tasks, such as 
displaying the conditions that caused the breakpoint to occur, 
displaying and modifying register or memory contents, or 
single-stepping through program execution. 

The following sections describe the debug registers and the 
various types of breakpoints and exceptions that the processor 
supports. 

Debug Registers 

Starting on page 284, Figures 83 through 86 show the 32-bit 
debug registers supported by the processor. 
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Symbot Desaiption Bits 

- LEN 3 Length of Breakpoinl #3 5 1 -50 

- lyw 5 Type of Transaction(5) to Trap 29-28 
-LEN 2 Length of Breakpoint #2 27-26 

- W 2 Type of Transaction(s) to Trap 25-24 

- LEN 1 Length of Breakpoint #1 23>22 

- R/W 1 Type of TransactionCs) to Trap 21-20 

- LEN 0 Length of Breakpoint #0 19-18 

- IVWO TypeofTran5action(s)toTrap 17-16 



51 50 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 



LEN 


fyw 


LEM 




LEN 




LEN 


R/W 


3 


3 


2 


2 


1 


1 


0 


0 



13 


12 n 10 


9 


8 


7 


5 


5 


4 


3 


2 


1 


0 


G 




c 


L 


G 


L 


L 


L 


G 


L 


C 


L 


1 




E 


E 


5 


3 


2 


2 


1 


1 


0 


0 



Symbol 


Description 


Bit 


GD 


General Detect Enabled 


13 


GE 


Global Exact Breakpoint Enabled 


9 - 


LE 


Local Exact Breakpoint Enabled 


8 - 


G3 


Global Exact Breakpoint # 3 Enabled 


7 - 


13 


Local Exact Breakpoint # 3 Enabled 


6 - 


C2 


Global Exact Breakpoint # 2 Enabled 


5 - 


L2 


Local Exact Breakpoint # 2 Enabled 


4 - 


Gl 


Qobal Exact Breakpoint* 1 Enabled 


3 - 


LI 


Local Exaa Breakpoint # 1 Enabled 




GO 


Global Exact Breakpoint # 0 Enabled 


1 - 


LO 


Local Exaa Breakpoint # 0 Enabled 


0 - 



Hgure83. Debug Register DR7 
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31 50 29 28 27 25 25 24 23 22 21 20 19 18 17 16 15 14 13 12 Jl 10 9 8 7 6 5 4 3 2 1 0 




m 

Symbol 


m 

Description 


Bi! 


BT 


Breakpoint Task Switch 


15 


BS 


Breakpoint Single Step 


14 


BD 


Breakpoint Debug Access Deteaed 


15 


B3 


Breakpoint #3 Condition Delected 


3 


B2 


Breakpoint #2 Condition Deteaed 


2 


Bl 


Breakpoint #1 Condition Detected 


1 


BO 


Breakixnnt #0 Condition Detected 


0 



B 


B 


B 


B 


3 


2 


I 


0 



Figure 84. Debug Register DR6 



DR5 



31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 




DR4 



31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 




Figure 85. Debug Registers DR5 and DR4 
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DR3 

31 30 29 28 27 25 25 24 23 22 21 20 19 18 17 16 15 14 13 12 H 10 9 8 7 6 5 4 3 2 1 0 



Breakpoint 3 32-bit Linear Address 



DR2 

31 30 29 2B 27 26 25 24 23 22 21 20 19 18 17 16 IS 14 13 12 11 10 9 8 7 6 5 4 3 2 I 0 



Breakpoint 2 32-bit Linear Address 



DRI 

51 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 



Breakpoint 1 32-b(t Linear Address 



DRO 

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 



Breakpoint 0 32-bit Linear Address 



Figure 86. Debug Registers DR3,DR2, DRI, and DRO 



DR3-DR0 



DR5-DR4 



The processor allows the setting of up to four breakpoints. 
DR3-DR0 contain the linear addresses for breakpoint 3 through 
breakpoint 0, respectively, and are compared to the linear 
addresses of processor cycles to determine if a breakpoint 
occurs. Debug register DR7 defines the specific type of cycle 
that must occur in order for the breakpoint to occur. 

When debugging extensions are disabled (bit 3 of CR4 is set to 
0), the DR5 and DR4 registers are mapped to DR7 and DR6, 
respectively, in order to be software compatible with previous 
generations of x86 processors. When debugging extensions are 
enabled (bit 3 of CR4 is set to 1), any attempt to load DR5 or 
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DR4 results in an undefined opcode exception. Likewise, any 
attempt to store DR5 or DR4 also results in an undefined 
opcode exception. 

OM If a breakpoint is enabled in DR7, and the breakpoint 

conditions as defined in DR7 occur, then the corresponding 
B-bit (B3-B0) in DR6 is set to 1. In addition, any other 
breakpoints defined using these particular breakpoint 
conditions are reported by the processor by setting the 
appropriate B-bits in DR6, regardless of whether these 
breakpoints are enabled or disabled. However, if a breakpoint is 
not enabled, a debug exception does not occur for that 
breakpoint. 

If the processor decodes an instruction that writes or reads DR7 
through DRO, the BD bit (bit 13) in DR6 is set to 1 (if enabled in 
DR7) and the processor generates a debug exception. This 
operation allows control to pass to the debugger prior to debug 
register access by software. 

If the Trap Flag (bit 8) of the EFLAGS register is set to 1, the 
processor generates a debug exception after the successful 
execution of every instruction (single-step operation) and sets 
the BS bit (bit 14) in DR6 to indicate the source of the 
exception. 

When the processor switches to a new task and the debug trap 
bit (T-bit) in the corresponding Task State Segment (TSS) is set 
to 1, the processor sets the BT bit (bit 15) in DR6 and generates 
a debug exception. 

DR7 When set to 1, L3-L0 locally enable breakpoints 3 through 0, 

respectively. L3-L0 are set to 0 whenever the processor 
executes a task switch. Setting L3-L0 to 0 disables the 
breakpoints and ensures that these particular debug exceptions 
are only generated for a specific task. 

When set to 1, G3-G0 globally enable breakpoints 3 through 0, 
respectively. Unlike L3-L0, G3-G0 are not set to 0 whenever the 
processor executes a task switch. Not setting G3-G0 to 0 allows 
breakpoints to remain enabled across all tasks. If a breakpoint 
is enabled globally but disabled locally, the global enable 
overrides the local enable. 
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The LE (bit 8) and GE (bit 9) bits in DR7 have no effect on the 
operation of the processor and are provided in order to be 
software compatible with previous generations of x86 
processors. 

When set to 1, the GD bit in DR7 (bit 13) enables the debug 
exception associated with the BD bit (bit 13) in DR6. This bit is 
set to 0 when a debug exception is generated. 

LEN3-LEN0 and RW3-RW0 are two-bit fields in DR7 that 
specify the length and type of each breakpoint as defined in 
Table 66. 



Table 66. DR7 LEN and RW Definitions 



LEN Bits^ 


RWBils 


Breakpoint 


00b 


OOb^ 


instruction Execution 


00b 


01b 


One-byte Data Write 


01b 


Two-byte Data Write 


lib 


Four-byte Data Write 


00b 


lOb^ 


One-byte I/O Read or Write 


01 b 


Two-byte J/0 Read or Write 


lib 


Four-byte \/0 Read or Write 


00b 


lib 


One-byte Data Read or Write 


01b 


Two-byte Data Read or Write 


lib 


Four-byte Data Read or Write 


Notes: 

I LEN bits equal to 10b is undefined. 

2. When RW equals 00b, LEN must be equal to 00b. 

5. When RWequals 10b, debugging extensions (DE) must be enabled (bk3GfCR4niu5tbeset 
to 1). IfDE is set to 0, then RW equal to lOb is undefined. 



Debug Exceptions 

A debug exception is categorized as either a debug trap or a 
debug fault. A debug trap calls the debugger following the 
execution of the instruction that caused the trap. A debug fault 
calls the debugger prior to the execution of the instruction that 
caused the fault. AH debug traps and faults generate either an 
Interrupt Olh or an Interrupt 03h exception. 
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Interrupt 01 h The following events are considered debug traps that cause the 

processor to generate an Interrupt Olh exception: 

■ Enabled breakpoints for data and I/O cycles 

■ Single Step Trap 

■ Task Switch Trap 

The following events are considered debug faults that cause the 
processor to generate an Interrupt Olh exception: 

■ Enabled breakpoints for instruction execution 

■ BD bit in DR6 set to 1 

interrupt 03h The INT 3 instruction is defined in the x86 architecture as a 

breakpoint instruction. This instruction causes the processor to 
generate an Interrupt 03h exception. This exception is a debug 
trap because the debugger is called following the execution of 
the INT 3 instruction. 

The INT 3 instruction is a one-byte instruction (opcode CCh) 
typically used to insert a breakpoint in software by writing CCh 
to the address of the first byte of the instruction to be trapped 
(the target instruction). Following the trap, if the target 
instruction is to be executed, the debugger must replace the 
INT 3 instruction with the first byte of the target instruction. 
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Clock Control 



The AMD-K6 3D processor supports five modes of clock 
control. The processor can transition between these modes to 
maximize performance, to minimize power dissipation, or to 
provide a balance between performance and power. (See 
"Power Dissipation'* on page 307 for the maximum power 
dissipation of the processor within the normal and 
reduced-power states.) 

The five clock-control states supported are as follows: 

■ Normal State: The processor is numing in Real Mode, 

Virtual-8086 Mode, Protected Mode, or System Management 
Mode (SMM), In this state, all clocks are running— including 
the external bus clock CLK and the internal processor 
clock — and the full features and functions of the processor 
are available. 

■ Halt State: This low-power state is entered following the 
successful execution of the HLT instruction. Diuring this 
state, the internal processor clock is stopped. 

■ Stop Grant State: This low-power state is entered following 
the recognition of the assertion of the STPCLK# signal. 
During this state, the internal processor clock is stopped. 

■ Stop Grant Inquire State: This state is entered from the Halt 
state and the Stop Grant state as the result of a 
system-initiated inquire cycle. 

■ Stop Clock State: This low-power state is entered from the 
Stop Grant state when the CLK signal is stopped. 
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The following sections describe each of the four low-power 
states. Figure 87 on page 297 illustrates the clock control state 
transitions. 



During the execution of the HLT instruction, the processor 
executes a Halt special cycle. After BRDY# is sampled asserted 
during this cycle, and then EWBE# is also sampled asserted, the 
processor enters the Halt state in which the processor disables 
most of its internal clock distribution. In order to support the 
following operations, the internal phase-lock loop (PLL) still 
runs, and some internal resources are still clocked in the Halt 
state: 

■ Inquire Cycles: The processor continues to sample AHOLD, 
BOFF#, and HOLD in order to support inquire cycles that 
are initiated by the system logic. The processor transitions to 
the Stop Grant Inquire state during the inquire cycle. After 
returning to the Halt state following the inquire cycle, the 
processor does not execute another Halt special cycle. 

■ Flush Cycles: The processor continues to sample FLUSH#. If 
FLUSH# is sampled asserted, the processor performs the 
flush operation in the same manner as it is performed in the 
Normal state. Upon completing the flush operation, the 
processor executes the Halt special cycle which indicates 
the processor is in the Halt state. 

■ Time Stamp Counter (TSC): The TSC continues to count in 
the Halt state. 

■ Signal Sampling: The processor continues to sample INIT, 
INTR, NMI, RESET, and SMI#. 

After entering the Halt state, all signals driven by the processor 
retain their state as they existed following the completion of 
the Halt special cycle. 



Halt State 



Enter Halt state 
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Exit Halt state 

The processor remains in the Halt state until it samples INIT, 
INTR (if interrupts are enabled), NMI, RESET, or SMI# 
asserted. If any of these signals is sampled asserted, the 
processor returns to the Normal state and performs the 
corresponding operation. All of the normal requirements for 
recognition of these input signals apply within the Halt state. 

Stop Grant State 



Enter Stop Grant State 

After recognizing the assertion of STPCLK#, the processor 
flushes its instruction pipelines, completes all pending and 
in-progress bus cycles, and acknowledges the STPCLK# 
assertion by executing a Stop Grant special bus cycle. After 
BRDY# is sampled asserted during this cycle, and then EWBE# 
is also sampled asserted, the processor enters the Stop Grant 
state. The Stop Grant state is like the Halt state in that the 
processor disables most of its internal clock distribution in the 
Stop Grant state. In order to support the following operations, 
the internal PLL still runs, and some internal resources are still 
clocked in the Stop Grant state: 

■ Inquire cycles: The processor transitions to the Stop Grant 
Inquire state during an inquire cycle. After returning to the 
Stop Grant state following the inquire cycle, the processor 
does not execute another Stop Grant special cycle. 

■ Time Stamp Counter (TSC): The TSC continues to count in 
the Stop Grant state. 

■ Signal Sampling: The processor continues to sample INTT, 
INTR, NMI, RESET, and SMI#. 

FLUSH* is not recognized in the Stop Grant state (unlike while 
in the Halt state). 

Upon entering the Stop Grant state, all signals driven by the 
processor retain their state as they existed following the 
completion of the Stop Grant special cycle. 
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Exit Stop Grant State 

The processor remains in the Stop Grant state until it samples 
STPCLK# negated or RESET asserted. If STPCLK# is sampled 
negated^ the processor returns to the Normal state in less than 
10 bus clock (CLK) periods. After the transition to the Normal 
state, the processor resumes execution at the instruction 
boundary on which STPCLK# was initially recognized. 

If STPCLK# is recognized as negated in the Stop Grant state 
and subsequently sampled asserted prior to returning to the 
Normal state, the processor guarantees that a minimum of one 
instruction is executed prior to re-entering the Stop Grant 
state. 

If INIT, INTR (if interrupts are enabled), FLUSH#, NMI, or 
SMI# are sampled asserted in the Stop Grant state, the 
processor latches the edge-sensitive signals (INIT, FLUSH#, 
NMI, and SMI#), but otherwise does not exit the Stop Grant 
state to service the interrupt. When the processor returns to the 
Normal state due to sampling STPCLK# negated, any pending 
interrupts are recognized after returning to the Normal state. 
To ensure their recognition, all of the normal requirements for 
these input signals apply within the Stop Grant state. 

If RESET is sampled asserted in the Stop Grant state, the 
processor immediately returns to the Normal state and the 
reset process begins. 
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Stop Grant Inquire State 



Enter Stop &aiit Inquire State 

The Stop Grant Inquire state is entered from the Stop Grant 
state or the Halt state when £ADS# is sampled asserted during 
an inquire cycle initiated by the system logic. The processor 
responds to an inquire cycle in the same manner as in the 
Normal state by driving HIT# and HITM#. If the inquire cycle 
hits a modified data cache line, the processor performs a 
writeback cycle. 

Exit Stop Grant Inquire State 

Following the completion of any writeback, the processor 
returns to the state from which it entered the Stop Grant 
Inquire state. 

Stop Clock State 



Enter Stop Clock State 

If the CLK signal is stopped while the processor is in the Stop 
Grant state, the processor enters the Stop Clock state. Because 
all internal clocks and the PLL are not running in the Stop 
Clock state, the Stop Clock state represents the 
minimum-power state of all clock control states. The CLK signal 
must be held Low while it is stopped. 

The Stop Clock state cannot be entered from the Halt state. 

INTR is the only input signal that is allowed to change states 
while the processor is in the Stop Clock state. However, INTR is 
not sampled until the processor returns to the Stop Grant state. 
All other input signals must remain unchanged in the Stop 
Clock state. 
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Clock Control 



Exit stop Clock state 

The processor returns to the Stop Grant state from the Stop 
Clock state after the CLK signal is started and the internal PLL 
has stabilized. PLL stabilization is achieved after the CLK 
signal has been running within its specification for a minimum 
of 1.0 ms. 

The frequency of CLK when exiting the Stop Clock state can be 
different than the frequency of CLK when entering the Stop 
Clock state. 

The state of the BF[2:0] signals when exiting the Stop Clock 
state is ignored because the BF[2:0] signals are only sampled 
during the falling transition of RESET. 
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HLT Instruction 



RESET, SMIfJNIT. 
or INTR Asserted 



Normal Mode 

-Real 

-Viftual-8086 
-Protected 

-sm 



STPCLK# Asserted 



STPCLX* Negated, 
or RESET Asserted 
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IJ 

Power and Grounding 



Power Connections 



The AMD-K6 3D processor is a dual voltage device. Two 
separate supply voltages are required — Vcc2 ^CC3- ^CC2 
provides the core voltage for the processor and Vqc^ provides 
the I/O voltage. See Chapter 14, "Electrical Data" on page 303 
for the value and range of Vqcz Vcc3- 

There are 28 Vqci^ 32 Vcc3» 68 Vgs pins on the processor. 
(See "Pin Designations" on page 342 for all power and ground 
pin designations.) The large number of power and ground pins 
are provided to ensure that the processor and package maintain 
a clean and stable power distribution network. 

For proper operation and functionality, all Vcc2> "^003* ^SS 
pins must be connected to the appropriate planes in the circuit 
board. The power planes have been arranged in a pattern to 
simplify routing and minimize crosstalk on the circuit board. 
The isolation region between two voltage planes must be at 
least 0.254mm if they are in the same layer of the circuit board. 
(See Figure 88 on page 300.) In order to maintain a 
low-impedance current sink and reference, the ground plane 
must never be split. 
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Power and Grounding 



Although the processor has two separate supply voltages, there 
are no special power sequencing requirements. The best 
procedure is to minimize the time between which Vcc2 
Vcc3 either both on or both off. 



-0.254 mm (min.) for 
isolation region 
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Figure 88. Suggested Component Placement 
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Power and Grounding 
Decoupling Reconmendations 



In addition to the isolation region mentioned in "Power 
Connections" on page 299, adequate decoupling capacitance is 
required between the two system power planes and the ground 
plane to minimize ringing and to provide a low-impedance path 
for return currents. Suggested decoupling capacitor placement 
is shown in Figure 88 on page 300. 

Surface mounted capacitors should be used under the 
processor's ZIF socket to minimize resistance and inductance in 
the lead lengths while maintaining minimal height. For 
information and recommendations about the specific value, 
quantity, and location of capacitors, see the AMD website at 
www.amdxom/K6/k6docs/. 

Pin Connection Requirements 

For proper operation, the following requirements for signal pin 
connections must be met: 

■ Do not drive address and data signals into large capacitive 
loads at high frequencies. If necessary, use buffer chips to 
drive large capacitive loads. 

■ Leave all NC (no-connect) pins unconnected. 

■ Unused inputs should always be connected to an 
appropriate signal level. 

• Active Low inputs that are not being used should be 
connected to V^ca through a 20-kohm pullup resistor. 

• Active High inputs that are not being used should be 
connected to GND through a pulldown resistor. 

■ Reserved signals can be treated in one of the following ways: 

• As no-connect (NC) pins, in which case these pins are left 
unconnected 

• As pins connected to the system logic as defined by the 
industry-standard Pentium interface (Socket 7) 

• Any combination of NC and Socket 7 pins 

■ Keep trace lengths to a minimum. 
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14 

Electrical Data 



Introduction 



This chapter contains electrical data that is subject to change. 
For the latest values, see the AMD website at 
www.amd.com/K6/k6docs/. 

Operating Ranges 



The functional operation of the AMD-K6 3D processor is 
guaranteed if the voltage and temperature parameters are 
within the limits defined in Table 67. 

Table 67. Operating Ranges 



Parameter 


Minimum 


Typical 


Maximum 


Comments 


VCQ 


2.1V 


2.2 V 


23 V 


Note 


Vco 


3.135 V 


3.30 V 


3.6 V 


Note 


l^CASE 


0°C 




ItfC 




Note: 

Vqq and V(^Q are ref&enced from 
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Absolute Ratings 



Functional operation of the processor is not guaranteed beyond 
the operating ranges listed in Table 67 on page 303. Exposure to 
conditions outside these operating ranges for extended periods 
of time can affect long-term reliability. Permanent damage can 
occur if the absolute ratings listed in Table 68 are exceeded. 



Table 68. Absolute Ratings 



Parameter 


Minimum 


Maximum 


Comments 




-0.5 V 


2.5 V 




Vca 


-0.5 V 


3.6 V 






-0.5 V 


Vca + 0.5 V and 
<4.0V 


Note 


TcASE (under bias) 


-65^C 


+110% 




TSTORACE 




+150*»C 




Note: 

Vpiu (the voltage on any ^0 pin) must not be greater than 0,5 V above the voltage being applied 
to Vccj. In addition, the Vpff^ voltage must never exceed 4.0 V. 
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Electrical Data 




DC Characteristics 



The DC characteristics of the processor are shown in Table 69. 
Table 69. DC Characteristics 



Symbol 


Parameter Description 


Preliminary Data 


Comments 


Mm 


Max 


V|L 


Input Low Voltage 


-03 V 


+0.8 V 




V|H 


Input High Voltage 


2.0 V 


Vca+o.3V 


Notel 


Vol 


Output Low Vbltage 




0.4 V 


loL«4.0-mA load 


VOH 


Output High Voltage 


2.4 V 




loH = 3-0-niA load 


»CC2 


2.2 V Power Supply Current 




6.50 A 


235 MHz, Note 2,7 


6.90 A 


250 MHz, Note 2,8 


7.35 A 


266 MHz, Note 2,7 


8.45 A 


300 MHz, Note 2,9 


9.40 A 


333 MHz, Note 27 


ICC3 


3.5 V Power Supply Current 




0.52 A 


233 MHz, Note 3,7 


0.53 A 


250 MHz, Note 3,8 


0.54 A 


266 MHz, Note 3,7 


0.56 A 


300 MHz, Note 3,9 


0.58 A 


333 MHz, Note 3,7 


ill 


Input Leakage Current 




tlSniA 


Note 4 


Ilo 


Output Leakage Current 




±15 ^lA 


Note 4 


hi 


Input Leakage Current Bias with Pullup 




-400 nA 


Notes 




Input Leakage Current Bias with Pulldown 




200 uA 


Note 6 


C|N 


Input Capacitance 




10 pF 




Q)UT 


Output Capacitance 




15 pF 




Notes: 

1 Vca ^^f^^ ^0 the voltage being applied to V^q during functional operation. 

2 V(;q = 13V- The maximum power supply current must be taken into account when designing a power supply. 
J. l^co =S-6V-Tbe maximum power supply current must be taken into account when designing a power supply. 

4. Refers to inputs and I/O without an internal pullup resistor and O^Vfff< I/^q 

5. Refers to inputs with an internal pullup and = 0.4 

6. Refers to inputs with an internal puHdown and V^f^ = 2.4 V. 

7. CLK frequency equals 66 MHz. 

8. CLK frequency equals 100 MHz. 

9. This spedficatiot) appSes to components using a CLK frequency of 66 MHz or 100 MHt 
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Electrical Data 



Table 69. DC Characteristics (continued) 



Symbol 


Parameter Desaiption 


Preliminary Data 


Comments 


MIn 


Max 




yo Capadtance 




20 pF 






CLK Capacitance 




10 pF 






Test Input Capacitance (TDI, TMS, TRST#) 




10 pF 




QOUT 


Test Output Capacitance (TDO) 




15 pF 






TCK Capacitance 




10 pF 




Holes: 

I Vqq refers to the vokoge being applkd to Vco dumg hmdond operation. 

2. Vqq =Z5V~The maximum power supply current must be taken into account when designing a power supply. 

^co "^^'SV-The maximum power supply current must be taken into account when designing a power supply. 
A. Refers to inputs and (/O without an internal pulfup resistor and 0 < ^ /ca 

5. Refers to inputs with on internal pullup and = OA V. 

6. Refers to inputs with an internal pulldown and V^fi^lA V. 
7 QK frequency equals 66 MHl 

8. OK frequency equals 100 MHl 

9. This specification applies to components using a CLK frequency of 66 MHz or 100 MHz 
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Power Dissipation 



Table 70 contains the typical and maximum power dissipation 
of the processor during normal and reduced power states. 



Table 70. lypical and Maximum Power Dissipation 



aock Control State 


233 MHz^ 


250 MHz^ 


266 MHz^ 


300 MHz^ 


333 MHZ^ 


Comments 


Normal (Maximum Thermal Power) 


13.50 W 


13.85 W 


14.70 W 


17.20 W 


19.00 W 


Note 1,2 


Normal (Typical Thermal Power) 


8.10 W 


8.30 W 


8.85 W 


1035 W 


11.40 W 


Notes 


Stop Grant / Halt (Maximum) 


2.46 W 


2.47 W 


2.48 W 


2.50 W 


2.52 W 


Note 4 


Stop Clod (Maximum) 


2.25 W 


2.25 W 


225 VJ 


2J5W 


2.25 W 


Notes 


Motes: 

The maximum power dissipated in the nomral dock control state must be taken into account when designing a solution 
for thermal dissipation for the AM[H(6 50 processor. 
2 Maximum power is det&mined for the worst<ase instruction sequence or hinction for the listed dock control states with 
VcQ=^12VandVco=S.5V. 

3. Typical power is determined for the typic(^ instruction sequences or functions assodated with normal system operation with 
Vca='22VandVca = ^'3V. 

4. The CLK signal and the internal PLL are still running but most internal docking has stopped. 

5. The CLK signal, the internal PLL, and all internal docking has stopped. 

6. CLK frequency equals 66 MHz. 

7. CLK frequency equals JOOMHl 

8. This spec^ation appSes to components using a CLK frequency of 66 MHz or 100 MHz. 
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IS 

I/O Buffer Characteristics 



Introduction 



This chapter contains data that is subject to change. For the 
latest I/O buffer characteristics, see the AMD website at 
www.amd.com/K6/k6docs/. 

All of the AMD-K6 3D processor inputs, outputs, and 
bidirectional buffers are implemented using a 3.3V buffer 
design. In addition, a subset of the processor I/O buffers include 
a second, higher drive strength option. These buffers can be 
configured to provide the higher drive strength for applications 
that place a heavier load on these I/O signals. 

AMD has developed two I/O buffer models that represent the 
characteristics of each of the two possible drive strength 
configurations supported by the processor. These two models 
are called the Standard I/O Model and the Strong I/O Model. 

AMD developed the two models to allow system designers to 
perform analog simulations of processor signals that interface 
with the system logic. Analog simulations are used to determine 
a signal's time of flight from soiu-ce to destination and to ensure 
that the system's signal quality requirements are met. Signal 
quality measurements include overshoot, undershoot, slope 
reversal, and ringing. 
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I/O Buffer Characteristics 



Selectable Drive Strength 



The processor samples the BRDYC# input during the falling 
transition of RESET to configure the drive strength of A[20:3], 
ADS#, HITM# and W/R#. If BRDYC# is 0 during the fall of 
RESET, these particular outputs are configured using the 
higher drive strength. If BRDYCtt is 1 during the fall of RESET, 
the standard drive strength is selected for all I/O buffers. 

Table 71 shows the relationship between BRDYC# and the two 
available drive strengths — K6STD and K6STG. 

Table 71. A[20:3], ADS#, HITM#, and W/R# Strength Selection 



Drive Strength 


BRDYC# 


yo Buffer Name 


Strength 1 (standard) 


1 


K6STD 


Strength 2 (strong) 


0 


KGSTG 



I/O Buffer Model 



AMD provides models of the processor I/O buffers for system 
designers to use in board-level simulations. These I/O buffer 
models conform to the I/O Buffer Information Specification 
(IBIS)y Version 2.1, The Standard I/O Model uses K6STD, the 
standard I/O buffer representation, for all I/O buffers. The 
Strong I/O Model uses K6STG, the stronger I/O buffer 
representation for A[20:3], ADS#, HITM#, and W/R#, and uses 
K6STD for the remainder of the I/O buffers. 

Both I/O models contain voltage versus current (V/I) and 
voltage versus time (V/T) data tables for accurate modeling of 
I/O buffer behavior. 

The following list characterizes the properties of each I/O 
buffer model: 

■ All data tables contain minimmn, typical, and maximum 
values to allow for worst-case, typical, and best-case 
simulations, respectively. 

■ The puUup, pulldown, power clamp, and ground damp 
device V/I tables contain enough data points to accurately 
represent the nonlinear nature of the V/I curves. In addition, 
the voltage ranges provided in these tables extend beyond 
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//O Buffer Characteristics 



the normal operating range of the processor for those 
simulators that yield more accurate results based on this 
wider range. Figure 89 and Figure 90 illustrate the 
minVtyp/max pulldown and pullup Yfl curves for K6STD 
between OV and 3.3 V. 

■ The rising and falling ramp rates are specified. 

■ The min/typ/max Vcc3 operating range is specified as 
3,135V, 3.3V, and 3.6V, respectively. 

■ Vfl = 0.8V, Vih = 2.0V, and V„eas= 1-5 V 

■ The R/L/C of the package is modeled. 

■ The capacitance of the silicon die is modeled. 

■ The model assumes the test load is 0 capacitance, resistance, 
inductance, and voltage. 




Fig ure 89. K 6STD Pulldown V/l Curves 
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IS I/O Buffer Characteristics 



I/O Model Application Note 



For the AMD-K6 3D processor I/O Biiff er IBIS Models and their 
application, see the AMD website at www.amd.com/K6/k6docs/. 



I/O Buffer AC and DC Characteristics 



See Chapter 16, "Signal Switching Characteristics" on page 313 
for the processor AC timing specifications. 

See Chapter 14, "Electrical Data** on page 303 for the processor 
DC specifications. 
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Signal Switching 
Cliaracteristics 

Introduction 



This chapter contains data that is subject to change. For the 

latest signal switching characteristics, see the AMD website at 
www.amd . com/K6/k6docs/. 

The AMD-K6 3D processor signal switching characteristics are 
presented in Table 72 through Table 81 starting on page 314. 
Valid delay, float, setup, and hold timing specifications are 
listed. These specifications are provided for the system 
designer to determine if the timings necessaiy for the processor 
to interface with the system logic are met. Table 72 and Table 73 
contain the switching characteristics of the CLK input. Table 74 
through Table 77 contain the timings for the normal operation 
signals. Table 78 and Table 79 contain the timings for RESET 
and the configuration signals. Table 80 and Table 81 contain the 
timings for the test operation signals. 

All signal timings provided are: 

■ Measured between CLK, TCK, or RESET at 1.5 V and the 
corresponding signal at 1.5 V — this applies to input and 
output signals that are switching from Low to High, or from 
High to Low 

■ Based on input signals applied at a slew rate of 1 V/ns 
between 0 V and 3 V (rising) and 3 V to 0 V (falling) 
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In Signal Switching Characteristics 

■ Valid within the operating ranges given in "Operating 
Ranges'* on page 303 

■ Based on a load capacitance {Ci) of 0 pF 

CLK Switching Ciiaracteristics 



Table 72 and Table 73 contain the switching characteristics of 
the CLK input to the processor for 100-MHz and 66-MH2 bus 
operation, respectively, as measured at the voltage levels 
indicated by Figure 91 on page 315. 

The CLK Period Stability specifies the variance (jitter) allowed 
between successive periods of the CLK input measured at 1.5 V. 
This parameter must be considered as one of the elements of 
clock skew between the processor and the system logic. 

Ciocic Switcliing Ciiaracteristics for 100-MHz Bus Operation 



Table 72. CLK Switching Characteristics for lOO-MHz Bus Operation 



SymlKil 


Parameter Description 


Advance Data 


Figure 


Comments 


Min 


Max 




Frequency 




100 MHz 




In Normal Mode 


ti 


CLK Period 


10.0 ns 




91 


In Nonma! Mode 




CLK High Time 


3.0 ns 




9) 




t3 


CLK Low Time 


3.0 ns 




91 




U 


CLK Fall Time 


0.15 ns 


1.5 ns 


91 




ts 


CLK Rise Time 


0.15 ns 


1.5 ns 


91 






CLK Period Stabifity 




i250ps 




Note 


Note: 

Jin& frequency power spectrum peaking must occur at frequencies greater than (Frequency of CLiQ/3 or less than 500 kHz. 
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Signal Switching Characteristics 




Clock Switching Cliaracteristics for 66-MHz Bus Operation 



TMtn. CLK Switching Characteristics for 66-MHz Bus Operation 



Symbol 


Parameter Description 


Prelimmary Data 


Figure 


Comments 


Mil) 


Max 




Frequency 


333 MH2 


66.6 MHz 




In Normal Mode 


ti 


CLK Period 


15.0 ns 


30.0 ns 


91 


In Normal Mode 


t2 


CLK High Time 


4.0 ns 




91 




t3 


CLK Low Time 


4.0 ns 




91 




t4 


CLK Fd!l Time 


0.15 ns 


1.5 ns 


91 






CLK Rise Time 


0.15 ns 


1.5 ns 


91 






CLK Period Stability 




±250ps 




Note 


Note: 

Jitter frequency power spectrum peaking must occur at frequencies greater than (Frequency of CLK)/3 or less than 500 KHz 
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^'9P^I ^^^^^^^ 
Valid Delay, Float Setup, and Hold Timings 



Valid delay and float timings are given for output signals during 
functional operation and are given relative to the rising edge of 
CLK. During boundary-scan testing, valid delay and float 
timings for output signals are with respect to the falling edge of 
TCK. The maximum valid delay timings are provided to allow a 
system designer to determine if setup times to the system logic 
can be met. Likewise, the minimum valid delay timings are used 
to analyze hold times to the system logic. 

The setup and hold time requirements for the processor input 
signals must be met by the system logic to assure the proper 
operation of the processor. The setup and hold timings during 
functional and boundary-scan test mode are given relative to 
the rising edge of CLK and TCK, respectively. 

Output Delay Timings for 100-MHz Bus Operation 



Table 74. Output Delay Timings for lOO-MHz Bus Operation 



Symbol 


Paiameter Description 


Advance Data 


Figure 


Comments 


Min 


Max 




A[3U]Valid Delay 


1.1 ns 


4.0 ns 


93 




^7 


A[31 3] Float Delay 




7.0 ns 


94 




tg 


ADS# Valid Delay 


1.0 ns 


4.0 ns 


93 




tg 


ADS# Float Delay 




70 ns 


94 




tio 


ADSC# Valid Delay 


1.0 ns 


4.0 ns 


93 




tn 


ADSC#noat Delay 




7.0 ns 


94 




tl2 


APV^lrd Delay 


1.0 ns 


5.5 ns 


93 




tl3 


AP Float Delay 




7.0 ns 


94 




tl4 


APCHK# Valid Delay 


1.0 ns 


4.5 ns 


93 




tl5 


BE[7:0]# Valid Delay 


1.0 ns 


4.0 ns 


93 




tie 


BE[7:0]# Float Delay 




70 ns 


94 




tt7 


BREQ Valid Delay 


1.0 ns 


4.0 ns 


93 




tl8 


CACHE# Valid Delay 


1.0 ns 


4.0 ns 


93 
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Table 74. Output Delay Timings for 100-MHz Bus Operation (continued) 



jyiniwi 


rdiaiiiciCi VcSuipuun 


Advance Data 




^UllllllClliS 


Min 


Max 


h$ 


CACHE#noat Delay 




7.0 ns 


94 






D/C# Valid Delay 


1.0 ns 


4.0 ns 


93 






D/C# Float Delay 




7.0 ns 


94 






D[63:0] Write Data Valid Delay 


1.3 ns 


45 ns 


93 






D[63:0] Write Data Roat Delay 




7.0 ns 


94 






DP[7:0] Write Data Valid Delay 


1.3 ns 


45 ns 


93 




t25 


DP[7:0l Write Data Float Delay 




7.0 ns 


94 




t26 


FERR# Valid Delay 


1.0 ns 


4.5 ns 


93 




t27 


HIT# Valid Delay 


1.0 ns 


4.0 ns 


93 




t28 


HITM# Valid Delay 


1.1 ns 


4.0 ns 


93 




t29 


HLDA Valid Delay 


1.0 ns 


4.0 ns 


93 




t30 


LOCK# Valid Delay 


1.1 ns 


4.0 ns 


93 




t31 


LOCK# Float Delay 




7.0 ns 


94 




tjJ 


M/IO# Valid Delay 


1.0 ns 


4.0 ns 


93 




tjj 


M/IO# Float Delay 




7.0 ns 


94 




tj« 


PCD V^lid Delay 


1.0 ns 


4.0 ns 


93 




tjs 


PCD Float Delay 




7.0 ns 


94 






PCHK# Valid Delay 


1.0 ns 


4.5 ns 


93 




t37 


PWT Valid Delay 


1.0 ns 


4.0 ns 


93 




t]8 


PWT Float Delay 




7.0 ns 


94 




tss 


sac Valid Delay 


1.0 ns 


4.0 ns 


93 




t«o 


SCyC Float Delay 




7.0 ns 


94 




t41 


SMIAa#V^lid Delay 


1.0 ns 


4.0 ns 


93 




t42 


W/R# Valid Delay 


1.0 ns 


4.0 ns 


93 






W/R#noat Delay 




7.0 ns 


94 
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m Signal Switching Characteristics 
Input Setup andHold Timings for 100-MHz Bus Operation 



Table 75. Input Setup and Hold Timings for 100*MHz Bus Operation 



IIIWI 




Advance Data 




Coinntents 


Min 


Max 




A[31:51 Setup Time 


3.0 ns 




95 




Us 


A[31:5]HotdTime 


1.0 ns 




95 




t46 


A20M# Setup Time 


3.0 ns 




95 


Notel 


t47 


A20M# Hold Time 


1.0 ns 




95 


Notel 


•^8 


AHOLD Setup Time 


3.5 ns 




95 




t49 


AHOLD Hold Time 


1.0 ns 




95 




tso 


AP Setup Time 


1.7 ns 




95 




tsi 


APHokirime 


1.0 ns 




95 




«52 


BOFF# Setup Time 


3.5 ns 




95 




«53 


BOFF# Hold Time 


1.0 ns 




95 




«54 


BRDY# Setup Time 


3.0 ns 




95 




tss 


BRDY# Hold Time 


1.0 ns 




95 






BRDYC# Setup Time 


3.0 ns 




95 




•57 


BRDYC# Hold Time 


1.0 ns 




95 




t58 


D[63:0] Read Data Setup Time 


1.7 ns 




95 




t59 


D[63:0lRead Data Hold Time 


1.5 ns 




95 




•60 


DP[7:0] Read Data Setup Time 


1.7 ns 




95 




tei 


DP[7:0]Read Data Hold Time 


13 ns 




95 




t62 


EADS# Setup Time 


3.0 ns 




95 




•63 


EADS# Hold Time 


1.0 ns 




95 




•64 


EWBE# Setup Time 


1.7 ns 




95 




•65 


EWBE# Hold Time 


1.0 ns 




95 




•66 


FLUSH* Setup Hme 


1.7 ns 




95 


Note 2 


Notes: 

I These level-sensitive signals con be asserted synchronously or asynchronously. To be sampled on a spedHc dock edge, setup and 
hold tin>es must be met If asserted asynchronously, they mjst be asserted lor a minimum pulse math oftvm dock. 

2. These edge-senstiveagnals can be asserted syndmnously or asynchronous. To be sampled on a speak dock edg^ setup and 
hold times must be met If asserted asynchronously, they must have beer) negated at least tm dads prior to assertion and must 
remain asserted at least tm docks. 
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Signal Switching Characteristics 




Table 75. Input Setup and Hold Timings for 100-MHz Bus Operation (continued) 



Symbol 


Parameter Description 


Advance Data 


Figure 


Comments 


Min 


Max 


^7 


FLUSH# Hold Time 


1.0 ns 




95 


Note 2 


kB 


HOLD Setup Time 


L7ns 




95 






HOLD Hold Time 


1.5 ns 




95 




tyo 


IGNNE# Setup Time 


1.7 ns 




95 


Notel 




1GNNE# Hold Time 


1.0 ns 




95 


Notel 


t72 


IN IT Setup Time 


L7 ns 




95 


Note 2 


t73 


INIT Hold Time 


1.0 ns 




95 


Note 2 


t74 


INTR Setup Time 


1.7 ns 




95 


Notel 


t75 


INTR Hold Time 


1.0 ns 




95 


Notel 


t76 


INV Setup Time 


1.7 ns 




95 




t77 


INV Hold Time 


1.0 ns 




95 




t78 


KEN# Setup Time 


3.0 ns 




95 




t79 


KEN# Hold Time 


1,0 ns 




95 




*80 


NA# Setup Time 


1.7 ns 




95 




tei 


NA# Hold Time 


1.0 ns 




95 




t82 


NMI Setup Time 


1.7 ns 




95 


Note 2 


t83 


m\ Hold Time 


1.0 ns 




95 


Note 2 


t84 


SM(# Setup Time 


1.7 ns 




95 


Note 2 


tB5 


SMI# Hold Time 


1.0 ns 




95 


Note 2 


t86 


STPCLK# Setup Time 


1.7 ns 




95 


Notel 


t87 


STPCLK# Hold Time 


1.0 ns 




95 


Notel 


tea 


WB/WT# Setup Tune 


1.7 ns 




95 




t89 


WB/WT# Hold Time 


1.0 ns 




95 




1 These le/el-sensitive signals can be asserted synchronously or asynchronously. To be sampled on a specific clock edge, setup and 
hold times must be met If asserted asynchronously, they must be asserted for a minimum pulse width of two docks. 

2 These edge-sensitive signals can be asserted synchronously or asynchronously. To be sampled on a speaHc clock edge, setup and 
hold times must be met If asserted asynchronously, they must have been negated at least two docks prior to assertion and must 
remain asserted at least two docks 
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Yfi S/^na/ Switching Cliaracteristics 
Output Delay Timings for 66-MHz Bus Operation 



l^ble 76. Output Delay Timings for 66-MHz Bus Operation 



Symbol 


Parameter Description 


Preliminary Data 


Figure 


Comments 


Min 


Max 


•6 


f\[3i .^j Vdiiu ueiay 


1.1 ns 


D. J ns 


ox 




^7 


Api.3j hioai ueiay 




10.0 ns 


94 




* 


Anc-tf- i/r>ii<4 nrtKit 
fvJ>if vdWU L/6l3y 


t.O ns 


6.0 ns 


y3 






AUw rioat Delay 




10.0 ns 


94 




t 

MO 


nUbKjtt vaiiu ueiay 


1.0 ns 


7.0 ns 


93 




* 

Ml 


ADSC# Roat Delay 




10.0 ns 


94 




4 

M2 


AP Valid Delay 


1.0 ns 


8.5 ns 


93 




4 

M3 


AP Roat Delay 




10.0 ns 


94 




4 


APCnK# Valid Delay 


1.0 ns 


8.3 ns 


93 




4 

Ms 


Dt[7.0JfF Valid Delay 


1.0 ns 


70 ns 


93 






Dt[7:0]# Float Delay 




10.0 ns 


94 






on^A if^irj . 

BREQ Valid Delay 


1.0 ns 


8.0 ns 


93 




4 


CACHt# Valid Delay 


1.0 ns 


7.0 ns 


93 






CACHE* Float Delay 




10.0 ns 


94 




4 


D/C# Valid Delay 


i.u ns 


/.u ns 


93 




^21 


TMCdt rinat PkAlau 

u/\jfF riodi ueiay 




10.0 ns 






t22 


D[63:0] write Data Vafid Delay 


13 ns 


7.5 ns 


93 




t25 


D[63:0] write Data Float Delay 




10.0 ns 


94 




*24 


DP[7:0] Write Data Valid Delay 


IJns 


7.5 ns 


93 




t25 


DP[7:0] Write Data Float Delay 




10.0 ns 


94 




t26 


FERR# Valid Delay 


1.0 ns 


8.3 ns 


93 




h 


KIT# Valid Delay 


1.0 ns 


6.8 ns 


93 




t28 


HITM#V^lid Delay 


1.1 ns 


6.0 ns 


93 




t29 


HIDA Valid Delay 


1.0 ns 


6.8 ns 


93 




t30 


LOCK* Valid Delay 


1.1 ns 


7.0 ns 


93 




tjl 


LOCK* Float Delay 




10.0 ns 


94 




t32 


M/IO# Valid Delay 


1.0 ns 


5.9 ns 


93 
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Signal Switching Characteristics 



Table 76. Output Delay Timings for 66-iHHz Bus Operation (continued) 



Symbol 


Parameter Description 


Preliminary Data 


Figure 


Comments 


Min 


Max 


tjj 


M/IO# Float Delay 




10.0 ns 


94 




»« 


PCD Valid Delay 


1.0 ns 


7.0 ns 


93 




hi 


PCD Float Delay 




10.0 ns 


94 




t56 


PCHK# Valid Delay 


1.0 ns 


7.0 ns 


93 




tj7 


PWT Valid Delay 


1.0 ns 


7.0 ns 


93 




t38 


PWT Float Delay 




10.0 ns 


94 




t39 


SCYC Valid Delay 


1.0 ns 


70 ns 


93 






sac Roat Delay 




10.0 ns 


94 






SM]Aa#Vl3lid Delay 


1.0 ns 


7.3 ns 


93 




t42 


W/R# Valid Delay 


1.0 ns 


70 ns 


93 




Uj 


W/R# Float Delay 




10.0 ns 


94 
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Signal Switching Characteristics 



Input Setup and Hold Timings for 66-MHz Bus Operation 



Table 77. Input Setup and Hold Timings for 66-MHz Bus Operation 



Symbol 


Parameter Description 


Preliminary Data 


Figure 


Comments 


Mm 


Max 


f 


A[3i .5] betup Time 


6.0 ns 




AC 

95 




« 

Ms 


Apl :5] Hold Time 


1.0 ns 




95 




t46 


A20M# Setup Time 


5.0 ns 




95 


Note 1 


U? 


A20M# Hold Time 


1.0 ns 




95 


Note 1 


^48 


AHOLD Setup Time 


5.5 ns 




95 




U9 


AHOLD Hold Time 


1.0 ns 




95 




<50 


AP Setup Time 


5.0 ns 




95 




^1 


AP Hold Time 


l.O ns 




95 




^52 


BOFF# Setup Time 


5.5 ns 




95 




*55 


BOFF# Hold Time 


1.0 ns 




95 




*54 


BRDY# Setup Time 


5.0 ns 




95 




4 

15s 


BRDY# Hold Time 


1.0 ns 




95 




«56 


BRDYC# Setup Time 


5.0 ns 




95 




»57 . 


BRDYC# Hold Time 


1.0 ns 




95 




«5B 


D[63:0] Read Data Setup Time 


2.8 ns 




95 




<59 


D[63:0] Read Data Hold Hme 


1.5 ns 




95 




<<» 


DP[7:0] Read Data Setup Time 


2.8 ns 




95 




ta 


DP[7:0]Read Data Hold Time 


1.5 ns 




95 




t62 


EADS# Setup Time 


5.0 ns 




95 




t63 


EADS# Hold Time 


1.0 ns 




95 




t64 


EWBE# Setup Time 


5.0 ns 




95 




l65 


EWBE#HokfTime 


1.0 ns 




95 




t66 


FLUSH# Setup Time 


5.0 ns 




95 


Note 2 


fMes: 

I These ka/eSsen^e signak can be asserted synchronousty or asynchronously. To be sampled on a specific dock edge, setup and 
hold times must be met If asserted asynchronously, they must be asserted for a minimum pulse width of twa dodi 

2. These edge-sensiti^ signals can be asserted synchronously or asynchronously To be sampled on a spedfic dock edge, setup and 
bold times must be met. If asserted asynchronously, they must have been negated at least two dodcs prior to assertion and must 
remain asserted at least two docks. 
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Signal Switching Characteristics 



Table 77. Input Setup and Hold Timings for 66-MHz Bus Operation (continued) 



Symbol 


Parameter Description 


Preliminary Data 


Figure 


Comments 


Min 


Max 




FLUSH* Hold Time 


1.0 ns 




95 


Note 2 


tes 


HOLD Setup Time 


5.0 ns 




95 




^69 


HOLD Hold Time 


1.5 ns 




95 




t70 


tGNNE# Setup Time 


5.0 ns 




95 


Notel 


t71 


IGNNE#Hold Time 


1.0 ns 




95 


Note! 


t72 


IN IT Setup Time 


5.0 ns 




95 


Note 2 


t73 


INIT Hold Time 


1.0 ns 




95 


Note 2 


t74 


INTR Setup Time 


5.0 ns 




95 


Notel 


t75 


iNTR Hold Time 


1.0 ns 




95 


Notel 


t76 


INV Setup Time 


5.0 ns 




95 




t77 


iNV Hold Time 


1.0 ns 




95 




t78 


KEN# Setup Time 


5.0 ns 




95 




t79 


KEN# Hold Time 


1.0 ns 




95 




*B0 


NA# Setup Time 


4 J ns 




95 




tai 


NA# Hold Time 


1.0 ns 




95 






NMI Setup Time 


5.0 ns 




95 


Note 2 


t83 


NMI Hold Time 


1.0 ns 




95 


Note 2 


t84 


SMi# Setup Time 


5.0 ns 




95 


Note 2 


t05 


SMI# Hold Time 


1.0 ns 




95 


Note 2 


tB6 


STPCLK# Setup Time 


5.0 ns 




95 


Notel 


l87 


STPCLK# Hold Time 


1.0 ns 




95 


Notel 


tea 


WB/WT# Setup Time 


45 ns 




95 




t89 


WB/V\rr# Hold Time 


1.0 ns 




95 




/. 7?7ese levd-^ensitiye signals can be asserted synchronousty or asynchronously. To be sampled on a specific clock edge, setup and 
bold times must be met If asserted asynchronously, they must be asserted for a minimum pulse wiM of two docks. 

2 These edge-serrsitrve signals can be asserted synchronously or asynchronously. To be sampled on a specific dock edge, setup and 
hold f/mes must be met If asserted asynchronously, they must have been negated at least two docks prior to assertion ana must 
remain asserted at least tm clocks. 



323 



177AMD0060359 




RESET and Test Signal Timing 



TaUe 78. RESET and Configuration Signals for 100-MHz Bus Operation 



Symbol 


Parameter Description 


Advance Data 


Figure 


Comments 


Min 


Max 




RESET Setup Time 


1.7 ns 




96 




tsi 


RESET Hold Time 


1.0 ns 




96 




t.2 


RESET Pulse Width, V^c and CLK Stable 


15 clocks 




96 




t93 


RESET Active After Vcc and CLK Stable 


1.0 ms 




96 




t94 


BF[2:0l Setup Time 


1.0 ms 




96 


Note 3 


t95 


BF[2:0]Holdllme 


2 clocks 




96 


Note 3 


k6 


BRDYC# Hold Time 


TO ns 




96 


Note 4 


«97 


BRDYC# Setup Time 


2 docks 




96 


Note 2 


^8 


BRDYC# Hold Time 


2 clocks 




96 


Note 2 


^9 


FLUSH* Setup Time 


1.7 ns 




96 


Notel 


^100 


FLUSH# Hold Time 


1.0 ns 




96 


Notel 




FLUSH# Setup Time 


2 clocks 




96 


Note 2 




FLUSH# Hold Time 


2 clocks 




96 


Note 2 


NoicK 

1 To be sampled on a specific dock edge, setup and bold times must be met the dock edge before the dock edge on whidj RESET 
is sampled negated. 

2 // asserted asyndironousiy, these signals must meet a minimum setup and hold time of two docks relative to the negation of 
RESET. 

3. BF12:0] must meet a minimum setup time of W msanda minimum hold time of two docks relative to the negation of RESET 

4. tf RESET is driven syndironously, BRDYC^must me& the spedfiedhold time relative to the negation of RESET 
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Signal Switching Characteristics 



Table 79. RESET and Configuration Signab for 66-MHz Bus Operation 



Symbol 


Parameter Description 


Prelimioary Data 


Figure 


Comments 


Mm 


Max 


^0 


RESET Setup Time 


5.0 ns 




30 




^91 


RESET Hold Time 


1.0 ns 




9d 




^92 


RcScT Pulse Width, and CLK stable 


15 clocks 




96 




^93 


RESET Active Alter V^c and CLK Stable 


1.0 ms 




96 




t94 


BF[2:0J Setup Time 


1.0 ms 




96 


Note 3 


t95 


BF[2:0]Hold Time 


2 clocks 




96 


Note 3 


^96 


BRDYC# Hold Time 


1.0 ns 




96 


Note 4 


kl 


BRDYC# Setup Time 


2 docks 




96 


Note 2 


^8 


BRDYC# Hold Time 


2 clocks 




96 


Note 2 


tgg 


FLUSH# Setup Time 


5.0 ns 




96 


Notel 


^100 


FLUSH# Hold Time 


1.0 ns 




96 


Notel 


^101 


FLUSH* Setup Time 


2 docks 




96 


Note 2 


^102 


FLUSH* Hold Time 


2 clocks 




96 


Note 2 


Motes: 

I To be sampled on a spedfic dock edge, setup and hdd times must be met the dock edge before the dod( edge on whkh RESET 
b sampled negated. 

2. If asserted asynchronously, these signals must meet a minimum setup and hold time of two docks rdative to the negation of 
RESET 

3. BFp.'O] must meet a minimum setup time of 10 ms and a minimum hold time of two docks relative to the negation of RESET 

4. If RESET is driven synchronously, BRDYC^ must meet the specified hold time relative to the negation of RESET 
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Signal Switching Characteristics 



'foUeSO. TCKWavelonn and TRST« riming at 25 MHz 



Symbol 


Parameter Description 


Preliminary Data 


Figure 


Comments 


MIn 


Max 




TCK Frequency 




25 MHz 


97 




*103 


TCK Period 


40.0 ns 




97 




^104 


TCK High Time 


14.0 ns 




97 




*105 


TCK Low Time 


14.0 ns 




97 




*106 


TCK Fall Time 




5.0 ns 


97 


Note 1 2 


^107 


TCK Rise Time 




5.0 ns 


97 


Note 1 2 


*108 


TRST# Pulse Width 


30.0 ns 




98 


Asynchronous 


Mutes.' 

/. Ris^ll times can be increased by 10 ns for each 10 MHz that TCK is run bebw its maximum frequency of 25 MHi 
2 Rb^Fdll times are measured between 0.8 V and 2.0 V. 



Table 81. Test Signal Timing at 25 MHz 



Symbol 


Parameter Description 


Preliminary Data 


Figure 


Notes 


Min 


Max 


^109 


TDI Setup Hme 


5.0 ns 




99 


Note 2 


^110 


TDI Hold Time 


9.0 ns 




99 


Note 2 


tin 


TMS Setup Time 


5.0 ns 




99 


Note 2 


tn2 


TMS Hold Time 


9.0 ns 




99 


Note 2 




TDO Valid Delay 


3.0 ns 


13.0 ns 


99 


Notel 




TDO Float Delay 




16.0 ns 


99 


Notel 


tns 


All Outputs (Non-Test) Valid Delay 


3.0 ns 


13.0 ns 


99 


Notel 




All Outputs (Non-Test) Float Delay 




16.0 ns 


99 


Notel 


tlI7 


All Inputs (Non-Test) Setup Time 


5.0 ns 




99 


Note 2 




All Inputs (Non-Test) Hold Time 


9.0 ns 




99 


Note 2 


Notes: 

I Parameter is measured from tfie TCK falling edge 
1 Param^er is measured from the KK rising edge 
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Signal Switching Characteristics 



WAVEFORM 



INPUTS 

Must be steady 



OUTPUTS 

Steady 



Can change from 



Changing from High to Low 



Can change 
from Low to High 



Changing from low to High 



Don't care, any 
change permitted 



Changing, Sate Unknown 



Figure 92. Diagrams Key 



(Does not apply) 



Center line is high 
impedance slate 



cue 



T. 



Output Signal Valid n 



Max 



Min 



Valid n+1 



V = 6, 8, la 12, M, 15, 17, 18, 2a 22. 24, 26^ 27, 28, 29, 30, 32, 34. 36, 37, 39, 41. 42 



Figure 93. Output Valid Delay Timing 
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Figure 94. Maximum Float Delay Timing 



CLK 
Input Signal 





T. 


^ T, 


T, 












1^1.-. 






m 




1 







S = H 46, 48. 50, 52, 54, 56, 58, 50, 62, 64, 66, 68, 7a 72. 74, 75, 78, 80, 82, 84, 86. 88 
h = 45. 47. 49, 51. 53, 55, 57, 59. 61. 63. 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89 



Figure 95. Input Setup and Hold Timing 
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Signal Switching Characteristics 



CLK 



RESET 



ausH# 

(S/nchronous) 



FLUSH#,BRDYC# 
(Asynchronous) 



(Asynchronous) 




" *97, JOI ■ 



k5 



Figure 96. Reset and Configuration Timing 
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17 

Thermal Design 



This chapter contains thermal data that is subject to change. 
For the latest values, see the AMD website at 
www.aind.coin/K6/k6docs/. 

Package Thermal Specifications 



The AMD-K6 3D processor operating specification calls for the 
case temperature (T^) to be in the range of O^'C to 70°C. The 
ambient temperatiure (T^) is not specified as long as the case 
temperature is not violated. The case temperature must be 
measured on the top center of the package. Table 82 shows the 
processor thermal specifications. 



Table 82. Package Thermal Specification 



TcCase 
Temperature 


lunction-Case 


Maximum Thermal Power 


233 MHz 


250 MHz 


266 MHz 


300 MHz 


333 MHz 


0*'C-70'*C 


1.7*C/W 


13.50 W 


13.85 W 


14.70 W 


T7.20W 


19.00 W 


stop Grant Mode 


2,46 W 


2.47 W 


2.48 W 


2J0W 


2.52 W 


Stop Qock Mode 


2.25 W 


2.25 W 


2:25 W 


2.25 W 


2.25 W 
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Thermal Design 



Figure 100 shows the thermal model of a processor with a 
passive thermal solution. The case-to-ambient temperature 
(Tca) can be calculated from the following equation: 



^CA PmAX 




Pmax 


• (OiF + 8sa) 


Where: 




Pmax 


Maximum Power Consumption 


6CA 


Case-to-Ambient Thermal Resistance 


eiF 


Interface Material Thermal Resistance 




Sink-to-Ambient Thermal Resistance 



Thermal 

Temperature Resistance 
(Ambient) (%/W) 

1 llllllll f 



'a 



IIMIII 



^ 'iiMiniiiiiHiiiiiiMiniininnT' f 



FigarelOO. Thermal Model 

Figure 101 on page 333 illustrates the case-to-ambient 
temperature (T^a) in relation to the power consumption 
(X-axis) and the thermal resistance (Y-axis). If the power 
consumption and case temperature are known, the thermal 
resistance (Oqa) requirement can be calculated for a given 
ambient temperature (T^) value. 
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Thermal Design 




10 w 



12W 



14W 16W 

Consumption (Watts) 



18W 



20W 



Figure 101. Power Consumption vs. Tliermal Resistance 



The following example calculates the required thermal 
resistance of a heatsink: 



If: 



Tc = 70°C 
Ta = 45°C 

Pmax = 19.0W at 333MHz 



Then: 



' C" ' A 



= -25!C_ ^ 1 32 (OQ/W) 
19. OW 



Thermal grease is reconmiended as interface material because 
it provides the lowest thermal resistance (= 0.20°C/W). The 
required thermal resistance (63;^) of the heatsink in this 
example is calculated as follows: 

%A = ©CA - ^IF 1.32 - 0.20 = 1.12 (*=>C/W) 
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Heat Dissipation Path 



Figure 102 illustrates the processor's heat dissipation path. 
Most of the heat generated by the processor is dissipated from 
the top surface (ceramic and lid) of the package. The small 
amount of heat generated from the bottom side of the processor 
where the processor socket blocks the convection can be safely 
ignored. 



The case temperature must be measured on the top center of 
the package where most of the heat is dissipated. Figure 103 
shows the correct location for measiiring the case temperature, 
(If a heat exchange device is installed, the thermocouple must 
contact the processor top surface through a drilled hole.) The 
case temperature is measured to ensure that the thermal 
solution meets the operational specification. 



Ambient Temperature 




ThinUd 



Figure 102. Processor Heat Dissipation Patli 



Measuring Case Temperature 



Thermocouple 




Figure 103. Measuring Case Temperature 
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Layout and Airflow Considerations 



Voltage Regulator 

A voltage regulator is required to support the lower voltage 
(3.3 V and lower) to the processor. In most applications, the 
voltage regulator is designed with power transistors. As a 
result, additional heatsinks are required to dissipate the heat 
from the power transistors. Figure 104 shows the voltage 
regulator placed parallel to the processor with the airflow 
aligned with the devices. With this alignment, the heat 
generated by the voltage regulator has minimal effect on the 
' processor. 




Figure 104. Voltage Regulator Placement 

A heatsink and fan combination can deliver much better 
thermal performance than a heatsink alone. More importantly, 
with a fan/sink the airflow requirements in a system design are 
not as critical. A unidirectional heatsink with a fan moves air 
from the top of the heatsink to the side. In this case, the best 
location for the voltage regulator is on the side of the processor 
in the path of the airflow exiting the fan sink (see Figure 105 on 
page 336). This location guarantees that the heatsinks on both 
the processor and the regulator receive adequate air 
circulation. 
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Thermal Design 



Airflow 




Ideal areas for voltage regulator " 



ngure 105. Airflow for a Heatsink with Fan 



Airflow Management in a System Design 



Complete airflow management in a system is important. In 
addition to the volume of air, the path of the air is also 
important. Figure 106 shows the airflow in a dual-fan system. 
The fan in the front end pulls cool air into the system through 
intake slots in the chassis. The power supply fan forces the hot 
air out of the chassis. The thermal performance of the heatsink 
can be maximized if it is located in the shaded area, where it 
receives greatest benefit from this air exchange system. 




336 



Fi^relOfi. Airflow Path in a Dual-fan System 
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Figure 107 shows the airflow management in a system using the 
ATX form-factor. The orientation of the power supply fan and 
the motherboard are modified in the ATX platform design. The 
power supply fan pulls cool air through the chassis and across 
the processor. The processor is located near the power supply 
fan, where it can receive adequate airflow without an auxiliary 
fan. The arrangement significantly improves the airflow across 
the processor with minimum installation cost. 




Figure 107. Airflow Path in an ATX Form-Factor System 
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Pins and Packaging 



Introduction 



This chapter contains information about the AMD-K6 3D 
processor pin grid array, pin designations, and packaging. Pin 
placement is shown in a top-side view in Figure 108 on page 340 
and in a pin-side view in Figure 109 on page 341. Table 83 on 
page 342 organizes the pins by functional grouping. Table 84 on 
page 343 gives the package specifications, which are illustrated 
in Figure 110 on page 344. 

This chapter contains packaging information that is subject to 
change. For the latest information, see the AMD website at 
www.amd.comyK6/k6docs/. 
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37 
36 
35 
34 
33 
32 
31 
30 
29 
26 
27 
26 
25 
24 
23 
22 
21 
20 
19 
18 
17 
16 
15 
14 
13 
12 
11 
10 
9 
8 
7 
6 
5 
4 
3 
2 
1 



• 


Control/Pdrily Pins 


o 


Address Pins 




VsjPins 


T 


Test Pins 


A 


Pins 




NC, INC (Intemal No Connect) Pins 


& 




® 


RSVD (Reserved) Rns 




Data Pins 


• 


Chip Positioning Key Pin 


A 


C E G J L 
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Figure 108. Processor Top-Side View 
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Figure 109. Processor Pin-Side View 
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Pin Designations 



Table 83. Processor Rmctioiiai Grouping 



Address 


Data 


Control 


Test 


NC 


Vol 




Vs5 


Pin 


Pin 


Pin 


Pin 


Pin 


Pin 


Pin 


Pin 


Pill 


Pin 


Pin 


Pta 


Name 


Nn 
no. 


no me 


Ma 

no. 


mine 


no. 


iwne 


NO* 


Na 


No. 


No. 


No. 


A3 


AL-3S 


DO 


K'54 


A20liA# 


AK-OB 


Ta 


M-34 


A-37 


Ar07 


A-19 


A AT Aii ^/l 


M 


AM-54 


Dl 


C-55 


ADS# 


AM}5 


TOI 


N-35 


E-17 


A-09 


A-21 


B-06 AM-22 


AS 


AK-32 


D2 


J-35 


A05C# 


AM-02 


TOO 


N-33 


E-25 


A-11 


A-23 


IHJo AM-24 


A6 


AN-33 


03 


C-33 


AHOLD 


V-04 
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R-34 
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A-25 


O in Alt 

B-IO AM-26 


A7 
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04 
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AE05 
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Q-33 
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B-I2 AM-28 


AS 
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05 
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BEO» 
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.. 
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Or] A AM-JO 


A9 
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06 
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BE1# 


AK-10 






W-33 


B-02 
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D If AIM Xt 

D-16 nlv-37 
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07 
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P ackage Sp ecifications 

321 -Pin Staggered CPGA Package Specification 



TaUe 84. 321-Pin daggered CPCA Package Spedfkalion 



Symbol 


Milfimeters 


Inches 


Min 


Max 


Notes 


Min 


Max 


Notes 


A 


49.28 


49.78 




1.940 


1.960 




B 


45.59 


45.85 




1.795 


1.805 




C 


51.32 


32.59 




1.233 


1.283 




D 


44.90 


45.10 




1.768 


1.776 




E 


2.91 


3.63 




0.115 


0.143 




F 


130 


1.52 




0.051 


0.060 




G 


3.05 


3.30 




0.120 


0.130 




H 


0.43 


0.51 




0.017 


0.020 




M 


2.29 


2.79 




0.090 


0.110 




N 


1.14 


1.40 




0.045 


0.055 




d 


1-52 


2.29 




0.060 


0.090 




e 


1.52 


2.54 




0.060 


0.100 




f 




0.13 


Flatness 




0.005 


Flatness 
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19 

Ordering Information 



The ordering information contained in this chapter is subject to 
change. For the latest information, see the AMD website at 
www. am d. com/k6/k6docs/. 
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Ordering Information 



standard AMD-K6 3D Processor Model 8 Products 

AMD standard products are available in several operating ranges. The ordering part 
number (OPN) is formed by a combination of the elements below. 



A MI)-K6 3P /500 



ft 



- Case Temperature 
Operating Voltage 

F = 2.IV-2.3V(Core)/3.135V-3.6V{l/0) 

' PadageType 

A = 321-pin CPCA 

■ Performance Rating 

/333 
/300 
/266 
/250 
/233 

' Family/Core 

AMD-K6 



Table 85. Valid Ordering Part lAimber Combinations 



OPN 


Package Type 


Operating Voltage 


Case Temperature 


AMD-K63D/333AFR 


321-pin CPCA 


2.1 V-2.3V (Core) 
3.135V-3.6V(l/0) 


(rC-70'C 


AMD-K6 3D/300AFR 


321 -pin CPGA 


2.1 V-2.3V (Core) 
3.135 V- 3.6 V (I/O) 


(fC-70*C 


AMD-K63D/266AFR 


321-prn CPGA 


2.1V- 2.3 V (Core) 
3.135V-3.6V(l/0) 


(fC-70"C 


AMD-K6 3D/250AFR 


321 -pin CPGA 


2.1V-2.3V (Core) 
3.135V-3.6V(l/0) 


(fC-70*C 


AMD-K63D/233AFR 


321-pln CPGA 


2.1 V-2.3V (Core) 
3.135V-3.6V(VO) 


CfC-70^C 


Me: 

This table lists conSguratbns planned to be support in volume for this device. Consult the local 
AMD sales ofUce to conSrm availability of specific vaUd combkiations and to dteck on 
newfy-released combinatms 
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Appendix A 



MMX Multimedia 
Teciinoiogy 



Introduction 



MMX multimedia technology was originally introduced in the 
AMD-K6 processor, before the development of 3D technology. 
The AMD-K6 3D processor executes both MMX and 3D 
instruction sets. References to multimedia technology in this 
appendix pertain to the MMX multimedia technology. For 
information on 3D technology, see Chapter 4, "3D Technology" 
on page 81, References to the AMD-K6 3D processor in this 
appendix also apply to the AMD-K6 processor. 

PC performance requirements are being driven by emerging 
multimedia and communications software. 3D graphics, video, 
audio, and telephony capabilities are evolving across education, 
entertainment, and internet applications. As multimedia 
applications continue to proliferate in the marketplace, PC 
systems suppliers are being challenged to deliver 
multimedia-enabled PC solutions at reasonable prices. 

The AMD-K6 3D processor incorporates a robust multimedia 
technology that is fully software compatible with the MMX 
technology as defined by Intel. This multimedia technology 
enables scaleable multimedia capabilities across a broad range 
of PC systems. 
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MM MMX Multimedia Tectinology 

The processor features a decode-decoupled superscalar 
microarchitecture and state-of-the-art design techniques to 
deliver true sixth-generation performance while maintaining 
full x86 binary software compatibility. An x86 
binary-compatible processor implements the industry-standard 
x86 instruction set by decoding and executing the x86 
instruction set as its native mode of operation. Only this native 
mode enables delivery of maximum performance when running 
PC software. 

The processor delivers leading-edge performance to 
mainstream PC systems running industry -standard x86 
software. The processor implements advanced design 
techniques like instruction pre-decoding, dual x86 opcode 
decoding, single-cycle internal RISC operations, parallel 
execution units, out-of-order execution, data forwarding, 
register renaming, and dynamic branch prediction. In other 
words, the AMD-K6 3D processor is capable of issuing, 
executing, and retiring multiple x86 instructions per cycle, 
resulting in superior scaleable performance. 

This appendix describes the multimedia technology of the 
processor, including data types, instructions, and progranuning 
considerations. 

MMX Multimedia Tectinology Architecture 



The multimedia technology in the processor is designed to 
accelerate media and commtmication applications. Specialized 
applications that use music synthesis, speech synthesis, speech 
recognition, audio and video compression and decompression, 
full motion video, 2D and 3D graphics, and video conferencing, 
can take advantage of the AMD-K6 3D processor multimedia 
technology. The multimedia technology implements new 
instructions, new data types, and powerful parallel processing 
(Single Instruction Multiple Data — SIMD) techniques that can 
significantly increase the performance of these applications. 

Key Functionality 

At the lowest levels, multimedia applications (audio, video, 3D 
graphics, and telephony, etc.) contain many similar functions. 
When these functions are performed on a processor that does 
not have MMX capability, the processor is heavily burdened by 
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MMX Multimedia Technology 




the computational requirements of this information. Processors 
executing the MMX instructions increase the performance of 
multimedia applications. This performance increase is a direct 
result of the increased multimedia bandwidth of the processor. 

Multimedia applications must process large amounts of data. 
Parallel data computing is exemplified by applications that 
manipulate screen pixel information. Instead of acting on one 
pixel at a time, multimedia technology enables the system to 
act on multiple pixels simultaneously. This SIMD model is a key 
feature of MMX technology. 

The AMD-K6 3D processor multimedia technology architecture 
includes four new MMX data types, 57 new MMX instructions, 
eight new 64-bit MMX registers, and an SIMD processing 
pipeline. The multimedia technology is compatible with 
existing x86 applications. 

The 57 new MMX instructions include arithmetic functions, 
packing and unpacking functions, logical operations, and 
moves. These are the basic functions that are most commonly 
used in repetitive computational multimedia programs. 

Multimedia applications often use smaller operands — 8-bit data 
is commonly used for pixel information and 16-bit data is used 
for audio samples. The new MMX registers allow data to be 
packed into 64-bit operands. For example, 8-bit data (1 byte) 
can be packed in sets of eight in a single 64-bit register, and all 
eight bytes can be operated on simultaneously by a single MMX 
instruction. 

For 256-color video modes, this translates to computing eight 
pixels per instruction. When an entire screen is being re-drawn, 
these pixel manipulation routines often use highly repetitive 
loops. Parallel processing of eight pieces of data can reduce the 
processing time of a code loop by up to a factor of eight. 

Multimedia applications frequently multiply and accumulate 
data. The multimedia technology provides instructions that 
add, multiply, and even combine these operations. For example, 
the PMADDWD instruction can multiply and then add words of 
data in a single instruction that uses far less processor cycles 
than the equivalent x86 operations. 
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MMX Multimedia Technology 

ExecutinfMMX A programmer must approach the use of MMX instructions 

Instnictfons differently, based on whether the code being developed is at 

the system level or at the application level. The details of these 
differences are discussed in "MMX Programming 
Considerations" on page 355. 

Before using the MMX instructions, the programmer must use 
the CPUID instruction to determine if the processor supports 
multimedia technology. For more information, see Appendix 
"AMD Processor Recognition" on page 505. 

Function 1 (EAX=1) of the processor CPUID instruction returns 
the processor feature bits in the £DX register. Software can 
then test bit 23 of the feature bits to determine if the processor 
supports the multimedia technology. If bit 23 is set to 1, MMX 
instructions are supported. All AMD-K6 3D processors have bit 
23 set. Once it is determined that multimedia technology is 
supported, subsequent code can use the MMX instructions. 
Alternatively, the AMD 8000_0001h extended CPUID function 
can be used to test whether the processor supports multimedia 
technology. 

After a module of MMX code has executed, the programmer 
must empty the MMX state by executing the EMMS command. 
Because the MMX registers share the floating-point registers, 
an instruction is needed to prevent MMX code from interfering 
with floating-point. The EMMS command clears the multimedia 
state and resets all the floating-point tag bits. Emptying the 
MMX state sets the floating-point tag bits to empty (all Is), 
which marks the MMX/FP registers as invalid and available. 

MMX Register Set 

The AMD-K6 3D processor implements eight new 64-bit MMX 
registers. These registers are mapped on the floating-point 
registers. As shown in Figure 111 on page 351, the new MMX 
instructions refer to these registers as mmO to mm7. Mapping 
the new MMX registers on the floating-point stack enables 
backwards compatibility for the register saving that must occur 
as a result of task switching. 
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Hgurelll. MMX Registers 

Aliasing the MMX registers onto the floating-point stack 
registers provides a safe way to introduce this new technology. 
Instead of needing to modify operating systems, new MMX 
applications can be supported through device drivers, MMX 
libraries, or DLL files. See "MMX Programming 
Considerations" on page 355 for more information. 

Current operating systems have support for floating-point 
operations. Using the floating-point registers for MMX code is 
an ingenious way of implementing automatic support for MMX 
instructions. Every time the processor executes an MMX 
instruction, all the floating-point register tag bits are set to zero 
(O0b=valid). Setting the tag bits after every MMX instruction 
prevents the processor from having to perform extra tasks. 
These extra tasks are normally executed on floating-point 
registers when the Tag field is something other than 00b. 

If a task switch occurs during an MMX or floating-point 
instruction, the Control Register (CRO) Task Switch (TS) bit is 
set to 1. The processor then generates an interrupt 7 (int 7 
Device Not Available) when it encounters the next 
floating-point or MMX instruction, allowing the operating 
system to save the state of the MMX/FP registers. 
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If there is a task switch when MMX applications are running 
with older applications that do not include MMX instructions, 
the MMX/FP register state is still saved automatically through 
the int 7 handler. 



MMX Data Type Details 



The processor multimedia technology uses a packed data 
format. The data is packed in a single, 64-bit MMX register or 
memory operand as eight bytes, four words, or two double 
words. Each byte, word, doubleword, or quadword is an integer 
data type. 

The form of an instruction determines the data type. For 
example, the MOV instruction comes in two different forms — 
MOVD moves 32 bits of data and MOVQ moves 64 bits of data. 

The four new data types are defined as follows: 



Packed bvte 



Packed word 



Packed 
doubleword 



Quadword 



Eight 8-bit bytes packed into 64 bits 
Signed integer range(-2^ to 2^-1) 
Unsigned integer range(0 to 2^-1) 



Four 16'bit words packed into 64-bits 
Signed integer range(-2^^to 2^^-l) 
Unsigned integer range(0 to 2^^-l) 



Two 32-bit doublewords packed into 64 bits 
Signed integer range(-2^^to 2^^-l) 
Unsigned integer range(0 to 2^^-l) 

One 64-bit quadword 

Signed integer range(-2^^to 2^^-l) 

Unsigned integer range(0 to 2^^-l) 



Figure 112 on page 353 shows the four new data types. 
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Figure 112. MMX Data lypes 



MMX Instructions 



The processor multimedia technology includes 57 new MMX 
instructions. These instructions are organized into the following 
groups: 

■ Arithmetic 

■ Empty MMX registers 

■ Compare 

■ Convert (pack/unpack) 

■ Logical 

■ Move 

■ Shift 

The following mnemonics are used in the instructions: 

■ P — Packed data 

■ B— Byte 

■ w— Word 

■ D — Doubleword 

■ Q — Quadword 

■ S — Signed 
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■ U — Unsigned 

■ SS — Signed Saturation 

■ US— Unsigned Saturation 

For example, the mnemonic for the PACK instruction that packs 
four words into eight unsigned bytes is PACKUSWB. In this 
mnemonic, the US designates an unsigned result with 
saturation, and the WB means that the source is packed words 
and the result is packed bytes. 

The term saturation is commonly used in multimedia 
applications. Saturation allows mathematical limits to be 
placed on the data elements. If a result exceeds the boundary of 
that data type, the result is set to the defined limit for that 
instruction. A common use of saturation is to prevent color 
wraparound. 



MMX Instruction Formats 

All MMX instructions, except the EMMS instruction that uses 
no operands, are formatted as follows: 

INSTRUCTION mmregl, mmreg2/mein64 

The source operand (mmreg2/mem64) can be either an MMX 
register or a memory location. The destination operand 
(mmregl) can only be an MMX register. 

The MOVD and MOVQ instructions also have the following 
acceptable formats: 

MOVD mmregl. mreg32/mem32 

MOVO inreg32/niein32. mmregl 

MOVO mem64. mmregl 

In the first example, the source operand (mreg32/mem32) can 
be either an integer register or a 32-bit memory address. The 
destination operand (mmregl) can only be an MMX register. 
The second example has the source operand as an MMX 
register. The destination operand (mreg32/mem32) can be 
either an integer register or a 32-bit memory address. The third 
example has the source operand as an MMX register and the 
destination operand as a 64-bit memory location 

The SHIFT instructions can also utilize an immediate source 
operand. It is designated as immS, 

PSRLW mmregl. imni8 
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MMX Programming Considerations 



This section describes considerations for programmers writing 
operating systems, compilers, and applications that utilize 
MMX instructions as implemented in the AMD-K6 3D 
processor. 



To use the AMD-K6 3D processor MMX multimedia technology, 
the programmer must determine if the processor supports it. 
The CPUID instruction gives programmers the ability to 
determine the presence of multimedia technology on the 
processor. Software must first test to see if the CPUID 
instruction is supported. For a detailed description of the 
CPUID instruction, see Appendix C, "AMD Processor 
Recognition" on page 505. 

The presence of the CPUID instruction is indicated by the ID 
bit (21) in the EFLAGS register. See "Testing for the CPUID 
Instruction" on page 506 for more information. 

If the processor supports the CPUID instruction, the 
programmer must execute the standard function, EAX=0. The 
CPUID function returns a 12-character string that identifies the 
processor's vendor. For AMD processors, standard function 0 
returns a vendor string of "Authentic AMD*'. This string 
requires the software to follow the AMD definitions for 
subsequent CPUID functions and the values returned for those 
functions. 

The next step is for the programmer to determine if MMX 
instructions are supported. Function 1 of the CPUID 
instruction provides this information. Fimction 1 (EAX=1) of 
the AMD CPUID instruction returns the feature bits in the EDX 
register. If bit 23 in the EDX register is set to 1, MMX 
instructions are supported. The following code sample shows 
how to test for MMX instruction support. 

mov eax.l : setup function 1 

CPUID : can the function 

test edx, 800000 ; test 23rd bit 

jnz YES_MM ; multimedia technology supported 



MMX Feature Detection 
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Alternatively, the extended function 1 (EAX=8000_0001h) can 
be used to determine if MMX instructions are supported. 

mov eax,8000_0001h ; setup extended function 1 

CPUID : call the function 

test edx, 800000 ; test 23rd bit 

jnz YES_MM ; multimedia technology supported 



Task Switching 



Cooperative 
Multttasking 



A task switch is an event that occurs within operating systems 
that allows multiple programs to be executed in parallel. Most 
modern operating systems utilizing task switching are called 
multitasking operating systems. 

There are two types of multitasking operating systems — 
cooperative and preemptive. 

In cooperative multitasking operating systems, applications do 
not care about other tasks that may be running. Each task 
assumes that it owns the machine state (processor, registers, I/O, 
memory, etc.). In addition, these tasks must take care of saving 
their own information (i.e., registers, stacks, states) in their own 
memory areas. The cooperative multitasking operating system 
does not save operating state information for the applications. 

There are different types of cooperative multitasking operating 
systems. Some of these operating systems perform some level of 
state saves, but this state saving is not always reliable. All 
software engineers programming for a cooperative multitasking 
environment must save the MMX or floating-point states before 
relinquishing control to another task or to the operating 
system. The FSAVE and FRSTOR commands are used to 
perform this task. Figure 113 on page 357 illustrates this task 
switching process. 

Note: Some cooperative operating systems may have API calls to 
perform these tasks for the application. 
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Figure 1 13. Cooperative Taste Switcliing 

In preemptive multitasking operating systems like OS/2, 
Windows NT, and UNIX, the operating system handles all state 
and register saves. The application programmer does not need 
to save states when programming within a preemptive 
multitasking environment. The preemptive multitasking 
operating system sets aside a save area for each task. 

In a preemptive multitasking operating system, if a task switch 
occurs, the operating system sets the Control Register 0 (CRO) 
Task Switch (TS) bit to 1. If the new task encounters a 
floating-point or MMX instruction, an interrupt 7 (int 7, Device 
Not Available) is generated. The int7 handler saves the state of 
the first task and restores the state of the second task. The int7 
handler sets the CRO.TS to 0 and returns to the original 
floating-point or MMX instruction in the second task. Figure 
114 illustrates this task switching process. 
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Figure 1 14. Preemptive Task Switching 

MMX Exceptions 

There are no new exceptions defined for supporting the MMX 
and 3D instructions. All exceptions that occur while decoding 
or executing an MMX or 3D instruction are handled in existing 
exception handlers without modification. See "3D Exceptions" 
on page 93 for more information. 

Mixing MMX and Floating-Point instructions 

The programmer must take care when writing code that 
contains both MMX and floating-point instructions. The MMX 
code modides should be separated from the floating-point code 
modules. All code of one type (MMX or floating-point code) 
should be grouped together as often as possible. To obtain the 
highest performance, routines should not contain any 
conditional branches at the end of loops that jump to code of a 
different type than the code that is ciurently being executed. 

In certain multimedia environments, floating-point and MMX 
instructions may be mixed. For example, if a programmer wants 
to change the viewing perspective of a three-dimensional scene, 
the perspective can be changed through transformation 
matrices using floating-point registers. The picture/pixel 
information is integer-based and requires MMX instructions to 
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manipulate this information. Both MMX and floating-point 
instructions are required to perform this task. 

The software must clean up after itself at the end of an MMX 
code module. The EMMS instruction must be used at the end of 
an MMX code module to mark all floating-point registers as 
empty (ll = empty/invalid). In cooperative multitasking 
operating systems, the EMMS instruction must be used when 
switching between tasks. 

Note: In some situations, experienced programmers can utilize the 
MMX registers to pass information between tasks. In these 
situations, the EMMS instruction is not required. 

The tag bits are affected by every MMX and floating-point 
instruction- After every MMX instruction except EMMS, all the 
tag bits in the floating-point tag word are set to 0. When the 
EMMS instruction is executed, all the tag bits in the tag word 
are set to 1. For more information, see "Floating-Point and 
MMX/3D Instruction Compatibility** on page 256. 

Prefixes 

All instructions in the x86 architecture translate to a binary 
value or opcode. This 1 or 2 byte opcode value is different for 
each instruction. If an instruction is two bytes long, the second 
byte is called the Mod R/M byte. The Mod R/M byte is used to 
further describe the type of instruction that is used. 

The x86 opcode and the Mod R/M byte can also be followed by 
an SIB byte. This byte is used to describe the Scale, Index and 
Base forms of 32-bit addressing. 

The format of the x86 instruction allows for certain prefixes to 
be placed before each instruction. These prefixes indicate 
different types of command overrides. 

The MMX instructions follow these rules just like all the 
current existing instructions. This allows for an easy 
implementation into the x86 architecture. All of the rules that 
apply to the x86 architecture apply to MMX instructions, 
including accessing registers, memory, and I/O. 
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Most opcode prefixes can be utilized while using MMX 
instructions. The following prefixes can be used with MMX 
instructions: 

■ The Segment Override prefixes (2Eh/CS, 36h/SS, 3Eh/DS, 
26h/£S, 64h/FS, and 6Sh/GS) affect MMX instructions that 
contain a memory operand. 

■ The LOCK prefix (FOh) triggers an invalid opcode exception 
(interrupt 6). 

■ The Address Size Override prefix (67h) affects MMX 
instructions that contain a memory operand. 



MMX Instruction Set 



The following MMX instruction definitions are in alphabetical 
order according to the instruction mnemonics. 
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EMMS 

mnemonic opcode description 

EMMS 0F77h Clear the MMX state 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 
Exceptions Generated: 



Exception 


Real 


Virtual 
6086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRD) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the con- 
trol register (CRO) is set to 1. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception Is pending due to the floating-point execution unit. 



The EMMS instruction is used to clear the MMX state following the execution of a 
block of code using MMX instructions. Because the MMX registers and tag words are 
shared with the floating-point unit, it is necessary to clear the state before executing 
code that includes floating-point instructions. 
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MOVD 



mnemonic 



opcode description 



MOVD mmregl, reg32/mcm32 



OF 6Eh Copy a 32*bit value from the general purpose register or 
memory location into the MMX register 



MOVD reg32/mem32, mmregl 


0F7Eh 


Copy a 32-bit value from the MMX register into the general 
purpose register or memory location 


Privilege: 
Registers Affected: 
Flags Affected: 
Exceptions Generated: 




none 
MMX 
none 




Exception 


Real 


Virtual 
8096 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The ennjlate MMX instruction bit (EM) of the control register (CRO) b set to I 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switdi bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instmdion execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an iDegal memory location. 


Segment overmn (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Pagefauh(14) 




X 


X 


A page fault resulted from the execution of the instructioa 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception b pending due to tiie floating-pomt execution unit. 


Alignment ched (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode. CPL» 3.) 



The MOVD instruction moves a 32-bit data value from an MMX register to a general 
purpose register or memory, or it moves the 32-bit data from a general purpose 
register or memory into an MMX register. If the 32-bit data to be moved is provided by 
an MMX register, the instruction moves bits 31-0 of the MMX register into the 
specified register or memory location. If the 32-bit data is being moved into an MMX 
register, the instruction moves the 32-bits of data into bits 31-0 of the MMX register 
and fills bits 63-32 with zeros. 



Related hutroctlons 



See the MOVQ instruction. 
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MOVQ 



mnemonic 



opcode description 



MOVQ mmregl, mmreg2/mem64 OF 6Fh Copy a 64-bit value from an MMX register or memory location 

into an MMX register 

MOVQ mmreg2/mem64, mmregl OF 7Fh Copy a 64-bit value from an MMX regkter into an MMX register 

or memory location 



Privilege: 

Registers Affected: 
Flags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exception 


Real 


Virtual 
8086 


Protected 


Descriptton 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overran (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit. 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction executioa 
and the alignment mask bit (AM) of the control register (CRO) is set to l. 
(In Protected Mode. CPL= 3.) 



The MOVQ instruction moves a 64-bit data value from one MMX register to another 
MMX register or memory, or it moves the 64-bit data from one MMX register or 
memory to another MMX register. Copying data from one memory location to another 
memory location cannot be accomplished with the MOVQ instruction. 



Related Instructions 



See the MOVD instruction. 
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PACKSSDW 

mnemonic opcode desa^tion 



PACKSSDW mmreg1,mmreg^mem64 0F6Bh Pack with saturation signed 32-bit operands into signed 

16-bit results 

Privilege: none 
Registers Affected: MMX 
Rags Affected: none 

Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Desaiption 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRO) Is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) issetto 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
to OFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction executioa 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, GPL = 3.) 



The PACKSSDW instruction performs a pack and saturate operation on two signed 
32-bit values in the first operand and two signed 32-bit values in the second operand. 
The four signed 16-bit results are placed in the specified MMX register. 

The pack operation is a data conversion. The PACKSSDW instruction converts or 
packs the four signed 32-bit values into four signed 16-bit values, applying saturating 
arithmetic. If the signed 32-bit value is less than -32768 (8000h), it saturates to -32768 
(8000h). K the signed 32-bit value is greater than 32767 (7FFFh), it saturates to 32767 
(7FFFh). All values between -32768 and 32767 are represented with their signed 
16-bit value. 

The first operand must be an MMX register. In addition to providing the first operand, 
this MMX register is the location where the result of the pack and saturate operation 
is stored. The second operand can be an MMX register or a 64-bit memory location. 
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Punctional Hnstration of the PACKSSDW Instructioii 



63 



mmreg2/mem64 
52 31 



8000 I aO02h 



0000 I sOOOh 








SOOOh 


7FFFh 


8002h 


oirch 



63 



48 47 



■ Indicates a saturated value 



52 31 

mmregl 



16 15 



The following list explains the functional illustration of the PACKSSDW instruction: 

■ Bits 63-32 of the source operand (mmreg2/mem64) are packed into bits 63-48 of 
the destination operand (mmregl). The result is saturated to the largest possible 
16-bit negative mmiber because the 32-bit negative source operand (8000_0002h) 
exceeds the capacity of the signed 16-bit destination operand. 

■ Bits 31-0 of the source operand are packed into bits 47-32 of the destination 
operand. The result is saturated to the largest possible 16-bit positive number 
because the 32-bit positive source operand (0000_8000h) exceeds the capacity of 
the 16-bit destination operand. 

■ Bits 63-32 of the destination operand are packed into bits 31-16 of the destination 
operand. The results are not saturated because the 32-bit negative source operand 
(FFFF_8002h) does not exceed the capacity of the 16-bit destination operand. 

■ Bits 31-0 of the destination operand are packed into bits 15-0 of the destination 
operand. The results are not saturated because the 32-bit positive source operand 
(OOOO.OlFCh) does not exceed the capacity of the 16-bit destination operand. 



Related Instnidloiis 



See the PACKSSWB instruction. 
See the PACKUSWB instruction- 
See the PUNPCKHWD instruction. 
See the PUNPCKLWD instruction. 
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PACKSSWB 

mnemonic opcode description 



PACKSSWB mmregl , nimreg^mem64 OF 63h Pack with saturation signed 1 6-bit operands into signed 8-bit 

results 

Privilege: none 
Registers Affected: MMX 
Rags Affected: none 
Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to i. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instcuction. 


Floating-point exception 
pending (15) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted h^om the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL = J.) 



The PACKSSWB instruction performs a pack and saturate operation on four signed 
16-bit values in the first operand and four signed 16-bit values in the second operand. 
The eight signed 8-bit results are placed in the specified MMX register. 

The pack operation is a data conversion. The PACKSSWB instruction converts or 
packs the eight signed 16-bit values into eight signed 8-bit values, applying saturating 
arithmetic. If the signed 16-bit value is less than -128 (80h), it saturates to -128 (80h). 
If the signed 16-bit value is greater than 127 (7Fh), it saturates to 127 (7Fh). All values 
between -128 and 127 are represented by their signed 8-bit value. 

The first operand must be an MMX register. In addition to providing the first operand^ 
this MMX register is the location where the result of the pack and saturate operation 
is stored. The second operand can be an MMX register or a 64-bit memory location. 
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Fimcdonal IHustraUon of the PACKSSWB tastnidioii 



63 



48 47 



mmreg2/mem64 

32 31 16 15 



48 47 



mmreg] 
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16 15 




■ Indicates a saturated value 
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mmregl 



The following list explains the functional illustration of the PACKSSWB instruction: 

■ Bits 63-48 of the source operand (mmreg2/mem64) are packed into bits 63-56 of 
the destination operand (mmregl). The result is not saturated because the 16-bit 
positive source operand (OOTEh) does not exceed the capacity of a signed 8-bit 
destination operand. 

■ Bits 47-32 of the source operand are packed into bits 55-48 of the destination 
operand. The result is saturated to the largest possible 8-bit positive number 
because the 16-bit positive source operand (7F00h) exceeds the capacity of a 
signed 8-bit destination operand. 

■ Bits 31-16 of the source operand are packed into bits 47-40 of the destination 
operand. The result is saturated to the largest possible 8-bit negative number 
because the 16-bit negative source operand (EF9Dh) exceeds the capacity of a 
signed 8-bit destination operand. 

■ Bits 15-0 of the source operand are packed into bits 39-32 of the destination 
operand. The result is not saturated because the 16-bit negative source operand 
(FF88h) does not exceed the capacity of the 8-bit destination operand. 

■ Bits 63-48 of the destination operand are packed into bits 31-24 of the destination 
operand. The result is saturated to the largest possible 8-bit negative number 
because the 16-bit negative source operand (FF02h) exceeds the capacity of a 
signed 8-bit destination operand. 
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■ Bits 47-32 of the destination operand are packed into bits 23-16 of the destination 
operand. The result is saturated to the largest possible 8-bit positive number 
because the 16-bit positive source operand (0085h) exceeds the capacity of a 
signed 8-bit destination operand. 

■ Bits 31-16 of the destination operand are packed into bits 15-8 of the destination 
operand. The result is not saturated because the 16-bit positive source operand 
(OOTEh) does not exceed the capacity of a signed 8-bit destination operand. 

■ Bits 15-0 of the destination operand are packed into bits 7-0 of the destination 
operand. The result is saturated to the largest possible 8-bit negative number 
because the 16-bit negative source operand (81CFh) exceeds the capacity of a 
signed 8-bit destination operand. 

Related Instructions See the PACKSSDW instruction. 

See the PACKUSWB instruction. 
See the PUNPCKHBW instruction. 
See the PUNPCKLBW instruction. 
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PACKUSWB 

mnemonic opcode description 

PACKUSWB mmregl, mmreg2/mem64 OF 67h Pack with saturation signedl6-bit operands into unsigned 

8-brt results 

Privilege: none 
Registers Affected: MMX 
Flags Affected: none 

Exceptions Generated: 



Exception 


Real 


virtual 
8086 


Protected 


Desalption 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instmcbon bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During Instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instnidion data operands falls outside the address range oooooh ' 
to OFFFFh. J 


Page fault (14) 




X 


X 1 A page fault resutted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X . An exception is pending due to the floating-point execution unit 


Alignment ched( (17) 




X 


X 


An unaligned memor/ reference resulted from the instrudran execution, 
and the alignment mask bit (AM) of tfie control register (CRO) is set to 1. 
(In Protected Mode, CPL = 3.) 



The PACKUSWB instruction performs a pack and saturate operation on four signed 
16-bit values in the first operand and four signed 16-bit values in the second operand. 
The eight unsigned 8-bit results are placed in the specified MMX register. 

The pack operation is a data conversion. The PACKUSWB instruction converts or 
packs the eight signed 16-bit values into eight unsigned 8-bit values, applying 
saturating arithmetic. If the signed 16-bit value is a negative number, it satiu'ates to 0 
(OOh). If the signed 16-bit value is greater than 255 (FFh), it saturates to 255 (FFh). All 
values between 0 and 255 are represented with their unsigned 8-bit value. 

The first operand must be an MMX register. In addition to providing the first operand, 
this MMX register is the location where the result of the pack and saturate operation 
is stored. The second operand can be an MMX register or a 64-bit memory location. 
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■ Indicates a saturated value 



63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 

mmregl 
(Unsigned) 



The following list explains the functional illustration of the PACKUSWB instruction: 

n Bits 63-48 of the source operand (mnireg2/mem64) are packed into bits 63-56 of 
the destination operand (nunregl). The result is saturated to the largest possible 
8-bit positive number because the 16-bit positive source operand (0112h) exceeds 
the capacity of an unsigned 8-bit destination operand. 

■ Bits 47-32 of the source operand are packed into bits 55-48 of the destination 
operand. The result is not saturated because the 16-bit positive source operand 
(008Bh) does not exceed the capacity of an unsigned 8-bit destination operand. 

■ Bits 31-16 of the source operand are packed into bits 47-40 of the destination 
operand. The result is saturated to the largest possible 8-bit positive number 
because the 16-bit positive source operand exceeds the capacity of an unsigned 
8-bit destination operand. 

■ Bits 15-0 of the source operand are packed into bits 39-32 of the destination 
operand. The result is saturated to OOh because the source operand (FF88h) is a 
negative value. 

n Bits 63-48 of the destination operand are packed into bits 31-24 of the destination 
operand (mmregl). The result is not saturated because the 16-bit positive source 
operand (0002h) does not exceed the capacity of an unsigned 8-bit destination 
operand. 

■ Bits 47-32 of the destination operand are packed into bits 23-16 of the destination 
operand. The result is saturated to the largest possible 8-bit positive number 
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because the 16-bit positive source operand (023Ah) exceeds the capacity of an 
unsigned 8-bit destination operand. 

Bits 31-16 of the destination operand are packed into bits 15-8 of the destination 
operand. The result is not saturated because the 16-bit positive source operand 
(007Eh) does not exceed the capacity of an unsigned 8-bit destination operand. 

Bits 15-0 of the destination operand are packed into bits 7-0 of the destination 
operand. The result is saturated to OOh because the source operand (FFF8h) is a 

negative value. 



Related Instrudions 



See the PACKSSDW instruction. 
See the PACKSSWB instruction. 
See the PUNFCKHBW instruction. 
See the PUNPCKLBW instruction. 
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PADDB 



mnemoni c opcode description 

PADDB mirregl , mmreg2/mem64 OF FCh Add unsigned packed 8-bit values 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 
Exceptions Generated: 



Exception 


Real 


virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instmcb'on bit (EM) of the omlrol register (^^^ 


Device not available (7) 


X 


X 


X 


Save the fkiating-point or MMX state if the task switch bit (T5) of the control 
register (CRO) Is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overun (13) 


X 


X 




One of the instruction data operands falls outskle the address range OOOOOh 
to OFFFFh. 


Page fault (U) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit. 


AOgnment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, GPL = 3.) 



The PADDB instruction adds eight unsigned 8-bit values from the source operand (an 
MMX register or a 64-bit memory location) to the eight corresponding unsigned 8-bit 
values in the destination operand (an MMX register). If any of the eight results is 
greater than the capacity of its 8-bit destination, the value wraps around with no carry 
into the next location. The eight 8-bit results are stored in the MMX register that is 
specified as the destination operand. 
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Rinctional IHustratioii of the PADDB Instruction 
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0 
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42h 


lOh 


E2h 


FEh 


22h| 



The following list explains the functional illustration of the PADDB instruction: 

■ The value 53h is added to ECh and wraps around to 3Fh. 

■ The value FCh is added to 14h and wraps around to lOh. 

■ The remaining addition operations are simple unsigned operations with no 
wraparound. 

Related histnictions See the PADDD instruction. 

See the PADDW instruction. 
See the PADDSB instruction. 
See the PADDSW instruction. 
See the PADDUSB instruction. 
See the PADDUSW instruction. 
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PADDD 

mnemonic opcode dacription 

PADDD mmregl , mmr€gVmem64 OF FEh Add unsigned packed 32-bit values 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 

Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate M WX instruction bit (EM) of the control register (CRO) is set to I. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to l. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effedive address of one of the segment 
re^'sters used for the operand points to an illegal memory locatioa 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit. 


ABgnment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL« 3.) 



The PADDD instruction adds two xinsigned 32-bit values from the source operand (an 
MMX register or a 64-bit memory location) to the two corresponding unsigned 32-bit 
values in the destination operand (an MMX register). If any of the two results is 
greater than the capacity of its 32-bit destination, the value wraps around with no 
carry into the next location. The two 32-bit results are stored in the MMX register 
specified as the destination operand. 
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RiBCtloiial niustratlon of the MDDD InstructiOB 
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mmreg2/mem64 
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The following list explains the functional illustration of the PADDD instruction: 

■ The value FFF0_5C43h is added to 000F„A3BEh and wraps around to OOOO^OOOlh. 

■ The second addition is a simple unsigned add operation with no wraparound. 



Related Instructions 



See the PADDB instruction. 
See the PADDW instruction. 
See the PADDSB instruction. 
See the PADDSW instruction. 
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PADDSB 



mnemonic 



opcode description 



PADDSB mmregl , mmreg2/mem64 OF ECh Add signed packed 8-bh values and saturate 



Privilege: 

Registers Affected: 
Flags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exceptioii 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MA4X instruction bit (EM) of the control register (CRD) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 

register (CRO) is set to 1. 


Stack exception (12) 




X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address o1 one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fauh resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) Is set to 1. 
(In Protected Mode,CPL»3.) 



The PADDSB instruction adds eight signed 8-bit values from the source operand (an 
MMX register or a 64-bit memory location) to the eight corresponding signed 8-bit 
values in the destination operand (an MMX register). If the sum of any two 8-bit values 
is less than -128 (80h), it saturates to -128 (80h). If the sum of any two 8-bit values is 
greater than 127 (7Fh), it saturates to 127 (7Fh). The eight signed 8-bit results are 
stored in the MMX register specified as the destination operand. 



376 



177AMD0060412 



MMX Multimedia Technology 




Fimdioiial Illustration of the PADDSB Instnictlon 
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■ Indicates a saturated value 
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42h 
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The following list explains the functional illustration of the PADDSB instruction: 

■ The signed 8-bit positive value OOh is added to the signed 8-bit positive value Olh 
with a signed 8-bit positive result of Olh. 

■ The signed 8-bit negative value D2h (-46) is added to the signed 8-bit negative 
value 88h (-120) and saturates to SOh (-128), the largest possible signed 8-bit 
negative value. 

■ The signed 8-bit positive value 53h (+83) is added to the signed 8-bit negative value 
ECh (-20) with a signed 8-bit positive result of 3Fh (+63). 

■ The signed 8-bit positive value 42h is added to the signed 8-bit positive value OOh 
with a signed 8-bit positive result of 42h. 

■ The signed 8-bit positive value 77h (+119) is added to the signed 8-bit positive 
value 14h (+20) and saturates to TFh (+127), the largest possible positive value. 

■ The signed 8-bit positive value 70h (+112) is added to the signed 8-bit positive 
value 44h (+68) and saturates to 7Fh (+127), the largest possible positive value. 

■ The signed 8-bit positive value 07h (+7) is added to the signed 8-bit negative value 
F7h (-9) with a signed 8-bit negative result of FEh (-2). 

■ The signed 8-bit negative value 9Ah (-102) is added to the signed 8-bit negative 
value A8h (-88) and saturates to SOh (-128), the largest possible signed 8-bit 
negative value. 

Related Instructions See the PADDB instruction. 

See the PADDD instruction. 
See the PADDW instruction. 
See the PADDSW instruction. 
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PADDSW 



mnemonic 



opcode description 



PADDSW mmregl , mmreg2/mem64 OF EDh Add signed packed 1 6-bit values and saturate 



Privilege: 
Registers Affected: 
Flags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an ilegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Pagefault(14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the ftoatin^oint execution unit 


Alignment died (17) 




X 


X 


An unaligned memory reference resulted from the instruction executmn, 
and the alignment mask bit (AM) of the control register (CRD) is set to 1. 
(In Protected Mode, CPL » 3.) 



The PADDSW instruction adds four signed 16-bit values from the source operand (an 
MMX register or a 64-bit memory location) to the four corresponding signed 16-bit 
values in the destination operand (an MMX register). If the sum of any two 16-bit 
values is less than -32768 (8000h), it saturates to -32768 (8000h). If the sum of any two 
16-bit values is greater than 32767 (7FFFh), it saturates to 32767 (7FFFh). The four 
signed 16-bit results are stored in the MMX register specified as the destination 
operand. 
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Funcfional niustntion of the PADDSW Instnidioii 
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■ Indicates a saturated value 



The following list explains the functional illustration of the PADDSW instruction: 

■ The signed 16-bit negative value D250h (-11696) is added to the signed 16-bit 
negative value 8807h (-30713) and saturates to 8000h (-32768), the largest possible 
signed 16-bit negative value. 

■ The signed 16-bit positive value 5321h (+21281) is added to the signed 16-bit 
negative value EC22h (-5086) with a signed 16-bit positive result of 3F43h 
(+16195). 

■ The signed 16-bit positive value 7007h (+28679) is added to the signed 16-bit 
positive value 0FF9h (+4089) and saturates to 7FFFh (+32767), the largest possible 
positive value. 

■ The signed 16-bit negative value FFFFh (-1) is added to the signed 16-bit negative 
value FFFFh (-1) with the negative 16-bit result of FFFEh (-2). 

Related Instnictlons See the PADDB instruction. 

See the PADDD instruction. 
See the PADDW instruction. 
See the PADDSB instruction. 
See the PADDUSB instruction. 
See the PADDUSW instruction. 
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PADDUSB 



mnemonic 



opcode description 



PADDUSB mmregl , mmregV'nem64 OF DCh Add unsigned packed a-bit values and saturate 



Privilege: 

Registers Affected: 
Flags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exce|»tion 


Real 


Vbtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (JEM) of the control register (CRO) is set to t . 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exceptbn(i2) 






X 


During instruction execution, the stack segment limit vvas exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory locatioa 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
to OFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction executioa 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL ^ 3.) 



The PADDUSB instruction adds eight unsigned 8-bit values from the source operand 
(an MMX register or a 64-bit memory location) to the eight corresponding imsigned 
8-bit values in the destination operand (an MMX register). The eight unsigned 8-bit 
results are stored in the MMX register specified as the destination operand. 

If the sum of any two unsigned 8-bit values is greater than 255 (FFh), it saturates to 
255 (FFh). 
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FuDcUonal Illustration of the PADDUSB Instruction 
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■ Indicates a saturated value 

The following list explains the functional illustration of the PADDUSB instruction: 

■ The sum of 7Fh and 81h is lOOh. This value is greater than FFh, so the result 
saturates to FFh. 

■ The sum of D2h and 88h is 15Ah. This value is greater than FFh, so the result 
saturates to FFh. 

■ The sum of 53h and ECh is 13Fh. This value is greater than FFh, so the result 

saturates to FFh. 

■ The sum of 42h and OEh is 50h. This value is not greater than FFh, so the result 
does not satiu^te. 

■ The sum of 77h and 14h is 8Bh. This value is not greater than FFh, so the result 

does not satiirate. 

■ The sum of 70h and 44h is B4h. This value is not greater than FFh, so the result 
does not saturate. 

■ The sum of 07h and F7h is FEh. This value is not greater than FFh, so the result 
does not saturate. 

■ The sum of 9Ah and A8h is 142h. This value is greater than FFh, so the result 
saturates to FFh. 

Related instractions See the PADDB instruction. 

See the PADDD instruction. 
See the PADDW instruction. 
See the PADDSB instruction. 
See the PADDSW instruction. 
See the FADDUSW instruction. 
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PADDUSW 

mnemonic opcode desaiption 

PADDUSW mmregl , mmreg:V^em64 OF DDh Add unsigned packed 1 6-bit values and saturate 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 

Exceptions Generated : 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state If the task switch bit (TS) of the control 
register (CRO) b set tot. 


Stack exception (t2) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protednn (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory locaton. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
to OFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execUion of the insbruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignnnent check (17) 




X 


X 


An unaligned memory reference resulted from the Instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode. CPL=3.) 



The PADDUSW instruction adds four unsigned 16-bit values from the source operand 
(an MMX register or a 64-bit memory location) to the four corresponding unsigned 
16-bit values in the destination operand (an MMX register). The four unsigned 16-bit 
results are stored in the MMX register specified as the destination operand. 

If the sum of any two unsigned 16-bit values is greater than 65,535 (FFFFh), it 
saturates to 65,535 (FFFFh), 
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Functional lUustration of the PADDUSW Instruction 
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■ Indicates a saturated value 

The following list explains the functional illustration of the PADDUSW instruction: 

■ The sum of 7E10h and 7000h is EElOh. This value is not greater than FFFFh, so the 
result does not saturate. 

■ The sum of SOOOh and SOOOh is lOOOOh. This value is greater than FFFFh, so the 
result saturates to FFFFh. 

■ The sum of FFFEh and OOlSh is 10013h. This value is greater than FFFFh, so the 
result saturates to FFFFh. 

■ The sum of 1234h and 4567h is 579Bh. This value is not greater than FFFFh, so the 
result does not saturate. 

Related Instructions See the PADDB instruction. 

See the PADDD instruction. 
See the PADDW instruction. 
See the PADDSB instruction. 
See the PADDSW instruction. 
See the PADDUSB instruction. 
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PADDW 

mnemonic opcode description 

PADDW mmregl , mmreg2/mem64 OF FDh Add unsigned packed 1 6-bit values 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 
Exceptions Generated: 



Exception 


Real 


virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the ftoating-point or MMX state If the task switch bit (FS) of the control 
register (CRO) is set to K 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory tocalion. 


Segment overrun (13) 
Page fault 04) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pencfing due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memor/ reference resulted from the instruction executnn, 
and the alignment mask bit (AM) of the control register (CRO) b set to 1. 
(In Protected Mode, CPL = 3.) 



The PADDW instruction adds four unsigned 16-bit values from the source operand (an 
MMX register or a 64-bit memory location) to the four corresponding unsigned 16-bit 
values in the destination operand (an MMX register). If any of the four results is 
greater than the capacity of its 16-bit destination, the value wraps around with no 
carry into the next location. The four 16-bit results are stored in the MMX register 
specified as the destination operand. 
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Fonctioaal lUiKfration of the PADDW iDStrudion 



63 



mmreg2/mem64 



mmregl 



mmregl 



8000h 


FFOOh 


OOFCh 


FFFFh 1 


+ 

63 


+ 


+ 


0 


0123h 




8014h 


FFFFh 1 


63 


0 


8123h 




8110h 


FFFEh 1 



The following list explains the functional illustration of the PADDW instruction: 

■ The value 8000h is added to 0123h with a normal unsigned result of 8123h. 

■ The value FFOOh is added to OlECh and wraps around to OOECh. 

■ The value OOFCh is added to 8014h with a normal signed result of SllOh. 

■ The value FFFFh is added to FFFFh and wraps around to FFFEh. 



Related Instructions 



See the PADDB instruction. 
See the PADDD instruction. 
See the PADDSB instruction. 
See the PADDSW instruction. 
See the PADDUSB instruction. 
See the PADDUSW instruction. 
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PAND 

mnem onic opcode description 

PAND mmregl, mmregymem64 OFDBh AND 64-bit values 

Privilege: none 

Registers Affected: MMX 

Rags Affected: none 
Exceptions Generated: 



Exception 


Real 


virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction executnn, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction executkin, the effective address of one of the segment 
registers used for the operand points to an illegal memory bcation. 


Segntent overnin (13) 


X 


X 




One of the instruction data operands fafls outside the address range OOOOOh 
to OFFFFh. 


Page fault (u) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception b pending due to the floatingixunt execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to l. 
(In Protected Mode. CPL = 3.) 



The PAND instruction operates on the 64-bit source and destination operands to 
complete a bitwise logical AND. The results are stored in the destination operand. If 
the corresponding bits in the source and destination operands both equal 1, the 
resulting bit is 1 in the destination. If either bit in the source or destination operands 
equals 0, the resulting bit is 0 in the destination. 

The PAND instruction can be used to extract operands from packed fields based on 
the masks that are produced by the compare instructions — ^PCMPEQ and PCMPGT. 
This technique can eliminate branch prediction overhead in MMX routines. 
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Functional lllustFatlon of the PAND instruction 



63 



48 47 



mmregl 
32 31 



16 15 



-r 



1010_1111_0000_1101 iOOOO_llll_0000_lin 1100_0001_0011_0001|1000_1100_1101_001ll 

' ' ' 



logica! AND 



Logical AND 



Logical AND 



Logical AND 



63 



48 47 



mmreg2/mem64 
32 31 



16 15 



oioi_iioo_noo_ooii 


lllOO^llOl. 

1 


_0100_1110 


10U_OOOl_0011_100lloilO_0011.0101_1001 j 


63 48 


47 


Re! 

mm 
32 


;ult 
regl 

31 16 15 0 


0000_1100_0000_0001 


0000_1101. 


_0000_1110 


iooo_oooi_ooii_oooiloooo_oooo_oioi_oooi j 



Rebtfed Instructions See the PANDN instruction. 

See the FOR instruction. 
See the PXOR instruction. 
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PANDN 



mnemonic 



opcode description 



PANDN mmregl, mmregymem64 OF DFh Invert a 64-bil value, then AN D the inverted value and a 64-bit 

value in memory or an MMX register 



Privilege: 

Registers Affected: 
Flags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exception 


Real 


Virtual 

8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EiM) of the control re^ster (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state If the task switch bit (TS) of the control 
register (CRO) Is set to I. 


Stack exception (12) 
General protection (13) 






X 


During instruction execution, the stack segment limit was exceeded. 




X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands fads outside the address range OOOOOh 
to OFFFFh. 


Page fault (14) 

Floating-point exception 
pending (16) 


X 


X 
X 


X 

" " X 


A page fault resulted from the execution of the instruction. 


An exception Is pending due to the fioating-point execution unit. 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the Instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) b set to 1. 
(In Protected Mode, CPl = 3.) 



The PANDN instruction first operates on the 64-bit destination operand (an MMX 
register) to complete a bitwise logical NOT, inverting each bit. This operation changes 
1 bits to 0 bits and 0 bits to 1 bits, storing the results in the destination operand. The 
inverted 64-bit destination operand is then logically AND'd with the 64-bit source 
operand (an MMX register or a 64-bit memory operand) to complete the PANDN 
operation. 

K corresponding bits in the source operand and the inverted destination operand are 
both 1, the resulting bit is 1 in the destination. If either bit in the source operand or 
the inverted destination operand is 0, the resulting bit is 0 in the destination. 

The PANDN instruction can be used to extract alternate operands from packed fields 
based on the inverse of the masks that are produced by the compare instructions — 
PCMFEQ and PCMPGT. This technique can eliminate branch prediction overhead in 
MMX routines. 
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Functional llhistration of the PANDN Instruction 

mmregl 



63 



48 47 



32 31 



16 15 



1 G ] 0_1 11 1_0000_1 1 0 1 I OOOQ_1 1 11^0O00_l 1 1 1 



1100„0001_0011.0001)1000_I100_U01_0011 



Invert 



Invert 



Invert 



Invert 




Logical AND 



Logical AND 



Logical AND 



Logical AND 



63 



48 47 



mmreg2/mem64 
32 31 



16 15 



oioi_iioo„iioo_ooii I iioo_iici_oioo_ino 



10U_0001_Q011^1DOl|oilO„0011_0101_1001 j 



63 



48 47 



Result 

mmregl 

32 31 



16 IS 



0101_OCOO_1100_0010| 110C_0000_01CO_0000 



OCll_0000_COOO_1000i0110_0011_0000_1000l 



Related instractions See the PAND instruction. 

See the FOR instruction. 
See the PXOR instruction. 
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PCMPEQB 



mnemontc 



opcode description 



KMPEQB mmregl , mmreg2/mem64 OF 74h Compare packed 8>bit values for equality 



Privilege: 

Registers Affected: 
Flags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instnjction bit (EM) of the control register (CRO) is set to 1. 


Device not availaUe (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (T5) of the control 
register (CRO) is set to L 


Stack exception (12) 






X 


During instruction execution, the slack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception H pending due to the floating-point execution unit. 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to i 
(In Protected Mode, CPLo 3.) 



The PCMPEQB instruction operates on 8-bit data values. The instruction compares 
two 8-bit values to determine if they are equal. 

If the corresponding bits in the two operands are equal, all the bits in that 8 bits of the 
destination operand are set to 1. If any of the corresponding bits in the two operands 
are not equal, all the bits in that 8 bits of the destination operand are set to 0. 
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FuncHonal llhistration of the KMPEQB Instraction 

H 32 51 



mmreg2/nieni64 



mmregl 



mmregl 



DBh 


15h 


43h 


FFh 


80h 


CEh 


Alh 


04h 1 


Compare 

63 


Compare 


Compare 


Compare 

32 


Compare 

31 


Compare 


Compare 


Compare 

0 


DDh 


:5h 




FFh 


80h 


EEh 


Alh 


]4h 1 


Result 

63 


Result 


Result 


Result 

32 


Result 

31 


Result 


Result 


Resuh 

0 


OOh 


KKh 


OOh 


FFh 


FFh 


OOh 


FFh 


OOh I 


False 


True 


False 


True 


True 


False 


True 


False 



Rcbrted Instractlons 



See the PCMPEQD instruction. 
See the PCMPEQW instruction. 
See the PCMPGTB instruction. 
See the PCMPGTD instruction. 
See the PCMPGTW instruction. 
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PCMPEQD 

mnemonic opcode description 

PCMPEQD mmregl, mmreg2/mem64 OF 76h Compare packed 32-bit values for equality 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 

Exceptions Generated: 



Exception 


Real 


Virtual 

8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instniction bit (EM) of the control register (CRO) is set to 1. 


Device not avaflaUe (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch knt (15) of the control 

register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General proteaion (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory locatioa 


Segment ovenrun (13) 


X 


X 




One of the instruction data operands faDs outside the address range OOOOOh 

toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit. 


Alignment check (17) 


X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, GPL 3.) 



The PCMPEQD instruction operates on 32-bit data values. The instruction compares 
two 32-bit values to determine if they are equal. 

If the corresponding bits in the two operands are equal, all the bits in that 32 bits of the 
destination operand are set to 1. If any of the corresponding bits in the two operands 
are not equal, all the bits in that 32 bits of the destination operand are set to 0. 
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Functional Illustration of the PCMPEQD Instruction 



63 



nimregymem64 



mmregl 



mmregl 



00C0BA14h 


EF031243h | 


63 


Compare 


Compare 

0 


0000BA13h 


£F031243h | 


63 


Result 


Result 

0 


OOOOOOOOh 


FFFFFFFFh | 


False 


True 



Related Instructions See the PCMPEQB instruction. 

See the PCMPEQW instruction. 
See the PCMPGTB instruction. 
See the PCMPGTD instruction. 
See the PCMPGTW instruction. 
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PCMPEQW 

mnemonic opcode description 

PCMPEQW mmregl , mmreg2/niefn64 OF 75h Compare packed 1 6-bit values for equality 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 

Exceptions Generated 



Exception 


Real 


Virtual 

8086 


Protected 


Desoiptkm 


Invalid opcode (6) 


X 


X 


X 


Hie ennjtateMIVIXin$tmdionbft(EM)ofthecmtrQlregis^ 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instmction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effedhre address of one of the segment 
registers used for the operand points to an illegal memory k)ca1ion. 


Segment ovemm (13) 


X 


X 




One of the instruction data operands Idk outside the address range OOOOOh 
toOFFFFh. 


Pagefoult(14) 




X 


X 


A page fault resulted from the execution of the InstrudiorL 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit. 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) Is set to 1. 
(In Protected Mode, CPL» 3.) 



The PCMPEQW instruction operates on 16-bit data values. The instruction compares 
two 16-bit values to determine if they are equal. 

If the corresponding bits in the tv^o operands are equal, all the bits in that 16 bits of the 
destination operand are set to 1. If any of the corresponding bits in the two operands 
are not equal, all the bits in that 16 bits of the destination operand are set to 0. 
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Fimcdonal lllustratioii of die PCMPEQW Instruction 



mmreg2/mem64 



mmregl 



mmregl 



63 



0AX4h 


80C0h 


1243h 


1234h 1 


Compare 

63 


Compare 


Compare 


Compare 

0 


DA24h 


8000h 


1243h 


1243h 1 


Result 

53 


Result 


Result 


Result 

0 


OOOOh 


FFFFh 


FFFFh 


OOOOh 1 


False 


True 


True 


False 



Related Instnictions 



See the PCMPEQB instruction. 
See the PCMPEQD instruction. 
See the PCMPGTB instruction. 
See the PCMPGTD instruction. 
See the PCMPGTW instruction. 
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PCMPGTB 

mnemonic opcode description 

PCMPGTB mmregl , mmres>^mem64 OF 64h Compare signed packed 8*bit values for magnitude 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 
Exceptions Generated : 



Exceptkm 


Real 


Virtual 
8086 


Protected 


Desaiption 


Invalid opcode <6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRO) is set to I 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bK (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protectkin (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segnvent overrun (13) 


X 


X 




One of the instniction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the executkm of the Instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(bi Protected Mode, CPL» 3.) 



The PCMPGTB instruction operates on signed 8-bit data values. The instruction 
compares two signed 8-bit values to determine if the value in the destination operand 
is greater than the corresponding signed 8-bit data value in the source operand. 

If the value in the destination operand is greater than the value in the source operand, 
all the bits in that 8 bits of the destination operand are set to 1. If the value in the 
destination operand is equal to or less than the value in the source operand, all the 
bits in that 8 bits of the destination operand are set to 0. 
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luKtioiial niuslntioa of the KMPCTB Instrnction 



63 32 51 ' 0 



minreg2/niem64 



mmregl 



OCh 


25h 


41h 


FFh 


80h 


7Fh 


A6h 


04h 1 


Greater? 

63 


Greater? 


Greater? 


Greater? 

32 


Greater? 

31 


Greater? 


Greater? 


Greater? 

0 


DDh 


24h 


42h 


Olh 


80h 


80h 


A3h 


14h 1 


Result 

63 


Result 


Result 


Result 

32 


Result 

31 


Result 


Result 


Result 

0 


FFh 


OOh 


FFh 


FFh 


OOh 


OOh 


OOh 


FFh 1 



True Fake True True False False False True 



The following list explains the functional illustration of the PCMPGTB instruction: 

■ The negative value DDh (-35) is greater than the negative value DCh (-36), so the 
result is true (FFh). 

■ The positive value 24h (+36) is not greater than the positive value 25h (+37), so the 
result is false (OOh). 

■ The positive value 42h (+66) is greater than the positive value 41h (+65), so the 
result is true (FFh). 

■ The positive value Olh (+1) is greater than the negative value FFh (-1), so the 
result is true (FFh). 

■ The negative value 80h (-128) is not greater than the negative value 80h (-128), so 
the result is false (OOh). 

■ The negative value SOh (-128) is not greater than the positive value 7Fh (+127), so 
the result is false (OOh). 

■ The negative value A3h (-93) is not greater than the negative value A6h (-90), so 
the result is false (OOh). 

■ The positive value 14h (+20) is greater than the positive value 04h (+4), so the 
result is true (FFh). 

Related Instructions See the PCMFEQB instruction. 

See the PCMPEQD instruction. 
See the PCMPEQW instruction. 
See the PCAIPGTD instruction. 
See the PCMPGTW instruction. 
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PCMPGTD 

mnemonic opcode desaiption 

PCMPGTD mmregl , mmreg2/mem64 OF 66h Compare signed packed 32-bit values for magnitude 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 

Exceptions Generated ^ 



Exception 


Real 


Virtual 
B086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX Instruction bit (EM) of the control register (CRO) is set to I 


Device not available (7) 


X 


X X 


Save the floatingi)oint or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 




X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 




— Mr- 

i 


During instruction execution, the effective address of one of tiie segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 


One of the instruction data operands falls outside the address range OOOOOh 
to OFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of tiie Instruction. 


Roating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit. 


Alignment check (17) 




X X 

i 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
On Protected Mode, CPL» 3.) 



The PCMPGTD instruction operates on signed 32-bit data values. The instruction 
compares two signed 32-bit values to determine if the value in the destination operand 
is greater than the corresponding signed 32-bit data value in the source operand. 

If the value in the destination operand is greater than the value in the source operand, 
all the bits in that 32 bits of the destination operand are set to 1. If the value in the 
destination operand is equal to or less than the value in the source operand, all the bits 
in that 32 bits of the destination operand are set to 0. 
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Fondioinl niustratton of the KMPCTD InstnicUon 



mmreg2/mem64 



mmregl 



mmregl 



63 



0000_BA14h 


FFFF.FFFFh | 


63 


Greater? 


Greater? 

0 


0000_BA15h 


OOOO.OOOOh 1 


65 


Result 


Result 

0 


FFFF.FFFFh 


FFFF.FFFFh | 


True 


True 



The following list explains the functional illustration of the PCMPGTD instruction: 

■ The positive value 0000_BA15h (+47637) is greater than the positive value 
(K)00_BA14h (+47636), so the result is true (FFFF.FFFFh). 

■ The positive value 0000_0001h (+1) is greater than the negative value 
FFFF__FFFFh (-1), so the result is true (FFFF^FFFFh). 



Related Instructions 



See the PCMFEQB instruction. 
See the PCMPEQD instruction. 
See the PCMPEQW instruction. 
See the PCMPGTB instruction. 
See the PCMPGTW instruction. 
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PCMPGTW 

mnemonic opcode description 

PCMPGTW mmregl, mmreg2/niem64 OF 65h Compare signed packed 1 6-bft values for magnitude 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 
Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Desalption 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instrudion bit (EM) of the control register (CRO) bsetto 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 




X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instructk>n execution, the effecthre address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instrudion data operands falls outside the address range OOOOOh 
to OFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the executbn of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit. 


Alignment check (17) 




X 


X 


An unaligned memory reference resuhed from the instnidkxi executk>n, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(mProtededMode, CPL=3.) 



The PCMPGTW instruction operates on signed 16-bit data values. The instruction 
compares two signed 16-bit values to determine if the value in the destination operand 
is greater than the corresponding signed 16-bit data value in the source operand. 

If the value in the destination operand is greater than the value in the source operand, 
all the bits in that 16 bits of the destination operand are set to 1. If the value in the 
destination operand is equal to or less than the value in the source operand, all the bits 
in that 16 bits of the destination operand are set to 6. 
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Rinctioiial lllnstratlon of the PCMPGTW InstrucUoii 



mmregymem64 



mmregl 



mmregl 



63 



OOOlh 


acooh 


FFFFh 


1234h 1 


Greater? 

63 


Greater? 


Greater? 


Greater? 

0 


DA14h 


8000h 


OOOlh 


1243h 1 


Result 

65 


Result 


Result 


Result 

0 


OOGOh 


OOOOh 


FFFFh 


FFFFh 1 


False 


False 


True 


True 



The following list explains the functional illustration of the PCMPGTW instruction: 

■ The negative value DA14h (-9708) is not greater than the positive value OOOlh (+1), 
so the result is false (OOOOh). 

■ The negative value 8000h (-32768) is not greater than the negative value 8000h 
(-32768), so the result is false (OOOOh). 

■ The positive value OOOlh (+1) is greater than the negative value FFFFh (-1), so the 
result is true (FFFFh). 

■ The positive value 1243h (+4675) is greater than the positive value 1234h (+4660), 
so the result is true (FFFFh). 



Related bistnictions 



See the PCMPEQB instruction. 
See the PCMPEQD instruction. 
See the PCMPEQW instruction. 
See the PCMPGTB instruction. 
See the PCMPGTD instruction. 
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& 



PIMADDWD 



mnemonic 



opcode description 



PMADDWDmmregl, mmrefi/n\en\M OFFSh Multiply signed packed 16-bit values and add the 32-bit 

results 



Privilege: 

Registers Affected: 
Rags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exception 


Real 


Virtual 
S086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


Ihe emulate MMX instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory bcatton. 


Segment overrun (13) 


X 


X 


One of the instruction data operands falls outside the address range OOOOOh 
to OFFFRi. 


Page fault (M) 




X 


X 


A page fault resulted from the execution of the instrudioa 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) Is set to 1. 
(In Protected Mode, CPL = 3.) 



The PMADDWD instruction multiplies signed 16-bit values from the source operand 
(an MMX register or a 64-bit memory location) by the corresponding signed 16-bit 
values in the destination operand (an MMX register), adds the resulting 32-bit values 
from the left and right halves of the 64-bit work space, and stores the 32-bit sums in 
the MMX destination register. 

Note: If all four of the IS-bit operands are SOOOh, the result wraps around to SOOOjOOOOh 
because the maximum negative 16-bit value of SOOOh multiplied by itself equals 
4000J)000h, and 4000_0000h added to 4000J)000h equals 8000_0000h. The result 
of multiplying two negative numbers should be a positive number, but SOOOjOOOOh 
is the maximum possible 32-bit negative number rather than a positive number. 
This is the only instance of wraparound that can occur as a result of the 
PMADDWD instruction. 
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FUnctfoiial lOustration of the PMADDWD Instruction 
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The following list explains the functional illustration of the PMADDWD instruction: 

■ The signed 16-bit negative value FFFEh (-2) is multiplied by the signed 16-bit 
positive value 0002h to produce a signed 32-bit negative intermediate result of 
FFFF_FFFCh (-4), 

■ The signed 16-bit positive value 7FFFh is multiplied by the signed 16-bit positive 
value 7FFFh to produce a signed 32-bit positive intermediate result of 
3FFF„0001h. 

■ The two 32-bit intermediate results are added together to produce the final signed 
32-bit positive result of 3FFE_FFFDh. 

■ The signed 16-bit positive value 7007h is multiplied by the signed 16-bit positive 
value 0FF9h to produce a signed 32-bit intermediate resiilt of 06FD_5FCFh. 

■ The signed 16-bit negative value FFFFh (-1) is multiplied by the signed 16-bit 
negative value FFFFh (-1) to produce a signed 32-bit positive intermediate result 
of 0000_0001h. 

■ The two 32-bit intermediate results are added together to produce the final signed 
32-bit positive result of 06FD_5FD0h. 



Related faistnidions 



See the PMULHW instruction. 
See the FMULLW instruction. 
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PMULHW 

mnemonic opcode description 

PMULHW mmregl, mmreg^mem64 OF E5h Multiply signed packed 1 6-bit values and store the high 1 6 

bits 

Privilege: none 
Registers Affected: MMX 
Rags Affected: none 
Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Desaiption 


invalid opcode (6) 


X 


X 


X 


The emulate MMX nstnjdion bit (EM) of the control register (CRO) b set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack 6xcepto(U) 




X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
rejp'sters used for the operand points to an illegat memor/ locatioa 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the Instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL= 3.) 



The PMULHW instruction multiplies four signed 16-bit values from the source 
operand (an MMX register or a 64-bit memory location) by the four corresponding 
signed 16-bit values in the destination operand (an MMX register) and then stores the 
high-order 16 bits of the result (including the sign bit) in the destination operand. 



404 



177AMD0060440 




RucHonal IlluslraUoii of the PIMULHW Instraction 
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The following list explains the functional illustration of the PMULHW instruction: 

■ The signed 16-bit negative value D250h (-2DB0h) is multiplied by the signed 16-bit 
negative value 8807h (-77F9h) to produce the signed 32-bit positive result of 
1569_4030h. The signed high-order 16-bits of the result are stored in the 
destination operand. 

■ The signed 16-bit positive value 5321h is multiplied by the signed 16-bit negative 
value EC22h (-13DEh) to produce the signed 32-bit negative result of F98C_7662h 
(-0673_899Eh). The signed high-order 16-bits of the result are stored in the 
destination operand. 

■ The signed 16-bit positive value 7007h is multiplied by the signed 16-bit positive 
value 0FF9h to produce the signed 32-bit positive result of 06FD_5FCFh. The 
signed high-order 16-bits of the result are stored in the destination operand. 

■ The signed 16-bit negative value FFFFh (-1) is multiplied by the signed 16-bit 
negative value FFFFh (-1) to produce the signed 32-bit positive result of 
OOOO.OOOlh. The signed high-order 16-bits of the result are stored in the 
destination operand. 

Related Instrnctioiis See the PMADDWD instruction. 

See the PMULLW instruction. 
See the PUNPCKHWD instruction. 
See the PUNPCKLWD instruction. 
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PMUUW 

mnemonic opcode desaipSon 



PMULLW mmregl, tnmreg^mem64 OF D5h Multiply signed packed 16-bit values and store the low 1 6 

bits 

Privilege: none 
Registers Affected: MMX 
Rags Affected: none 
Exceptions Generated: 



Exce|>tu»i 


Real 


Virtual 
8086 


Protected 


K}escription 


Invalid opcode (6) 


X 


X 


X 


17ie emulate MMX instnjdion (EM) of tfie control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stad segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (B) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
to OFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to tfie floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and ttie alignment nvisk bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL = 3.) 



The PMULLW instruction multiplies four signed 16-bit values from the source 
operand (an MMX register or a 64-bit memory location) by the four corresponding 
signed 16-bit values in the destination operand (an MMX register) and then stores the 
low-order 16 bits of the result (unsigned) in the destination operand. 
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Functional mustration of the PMUUW Instruction 
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The following list explains the functional illustration of the PMULLW instruction: 

■ The signed 16-bit negative value D250h (-2DB0h) is multiplied by the signed 16-bit 
negative value 8807h (-77F9h) to produce the signed 32-bit positive result of 
1569_4030h. The luisigned low-order 16-bits of the result are stored in the 
destination operand. 

■ The signed 16-bit positive value 5321h is multiplied by the signed 16-bit negative 
value EC22h (-13DEh) to produce the signed 32-bit negative result of F98C_7662h 
(-0673_899Eh). The unsigned low-order 16-bits of the result are stored in the 
destination operand. 

■ The signed 16-bit positive value 7007h is multiplied by the signed 16-bit positive 
value 0FF9h to produce the signed 32-bit positive' result of 06FD_5FCFh. The 
unsigned low-order 16-bits of the result are stored in the destination operand. 

■ The signed 16-bit negative value FFFFh (-1) is multiplied by the signed 16-bit 
negative value FFFFh (-1) to produce the signed 32-bit positive result of 
OOOOJOOlh. The unsigned low-order 16-bits of the result are stored in the 
destination operand. 



Related Instructions 



See the PMADDWD instruction. 
See the PMULHW instruction. 
See the PUNPCKHWD instruction. 
See the PUNPCKLWD instruction. 
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POR 



mnemonic 



opcode description 



POR mmregl , mmreg2/mem64 OF EBh OR 64-bit values 



Privilege: 

Registers Affected: 
Flags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opaxle (6) 


X 


X 


X 


The emulate MMX instmction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRD) is set to 1. 


Stack exception (12) 






X 


During instruction executbn, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruaion execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory bcation. 


Segment overrun (13) 


X 


X 




One of the instruction data operands f^ls outskle the address lange OOOOOh 
to OFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Ftoating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to I. 
(In Protected Mode, CPL = 3.) 



The POR instruction logically ORs the 64 bits of the source operand (an MMX register 
or a 64'bit memory location) with the 64 bits of the destination operand (an MMX 
register) and stores the result in the destination register. 

A logical OR produces a 1 bit if either or both input bits is a 1. If both input bits are 0, 
a logical OR produces a 0 bit. 
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hincUoiial lltastratloii of the POR Instnictioii 
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In the functional illustration of the POR instruction, the 64-bit source value is 
logically OR*d to the 64-bit destination value, and the result is stored in the 
destination register. 

Related Instnictlons See the PAND instruction. 

See the PANDN instruction. 
See the PXOR instruction. 
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PSLLD 



mnemonic 



opcode description 



PSLLD mmregl, mmreg^mem64 OF F2h Shift left logical packed 32-bit values in mmregl the number of 

positions in mmreg2/mem64 with zero fill from the right 
OF 72h /6 Shift left logical packed 32-bit values in mmregl the number of 
positions in immS with zero fill from the right 



PSLLD mmregl, immS 



Privilege: none 

Registers Affected: MMX 

Flags Affected: none 
Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state If the task switch bit (IS) of the control 

register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overnin (13) 


X 


X 




One of the instruction data operands falte outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Fk)ating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment chedc (T7) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL» 3.) 



The PSLLD instruction shifts the two 32-bit operands in the destination operand (an 
MMX register) to the left by the number of bit positions indicated by mmreg2/mem64 
or by immS, the 8-bit immediate operand. The shifted values are zero filled from the 
right. The two 32-bit results are stored in the MMX register specified as the 
destination operand. 
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Functional Illustration of the RSULD Instruction 
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The following list explains the functional illustration of the PSLLD instruction: 

■ The value 0000_0000_0000_0008h in mmreg2/mem64 indicates a shift of 8 bit 
positions to the left. 

■ The 32-bit value 000F_A3BEh in mmregl is shifted 8 bit positions to the left and 
stored in nunregl as 0FA3_BE00h. 

■ The 32-bit value 0123_4567h in mmregl is shifted 8 bit positions to the left and 
stored in mmregl as 2345_6700h. 



Related Instructions 



See the FSLLQ instruction. 
See the PSLLW instruction. 
See the PSRAD instruction. 
See the PSRAW instruction. 
See the PSRLD instruction. 
See the PSRLQ instruction. 
See the PSRLW instruction. 
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PSLLQ 



mnemomc 



opcode description 



PSLLQ mmregl , mmregVmem64 OF F3h 
PSLLQ mmregl, immS 



Shift left logical 64-bit values in mmregl the number of positions 
in mmreg2/mem64 with zero fill from the right 
OF 73h /6 Shift left logical 64-bit values in mmregl the number of positk)ns 
in imm8 with zero fill from the right 



Privilege: 
Registers Affected: 
Flags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exception 


Real 


Virtual 
6086 


Protected 


Descriptioii 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRO) b set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instructnn execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory k)cation. 


Segment Gvernm (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


[ Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit. 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL» 3.) 



The PSLLQ instruction shifts the 64-bit operand in the destination operand (an MMX 
register) to the left by the number of bit positions indicated by mmreg2/meni64 or by 
immS, the 8-bit inunediate operand. The shifted value is zero filled from the right. The 
64-bit result is stored in the MMX register specified as the destination operand. 
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FiiRclioiial llhistratioB of the PSLLQ Instruction 
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The following list explains the functional illustration of the PSLLQ instruction: 

■ The value 0000_0000_0000_0008h in mnireg2/mem64 indicates a shift of 8 bit 
positions to the left. 

■ The 64-bit value 000F_A3BE_0123_4567h in mmregl is shifted 8 bit positions to the 
left and stored in mmregl as 0FA3_BE01_2345_6700h. 



Related bistnidions 



See the PSLLD instruction. 
See the PSLLW instruction. 
See the PSRAD instruction. 
See the PSRAW instruction. 
See the PSRLD instruction. 
See the PSRLQ instruction. 
See the PSRLW instruction. 
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PSLLW 



mnemonK 



opcode description 



PSLLW mmregl , mmreg2/nnem64 OF Fl h Shift left logical packed 1 6-bit values in mmregl the number of 

positions in mmreg2/mem64 with zero fill from the right 
OF 7 1 h /6 Shift left logical packed 1 6-bit values in mmregl the number of 
positions in immS with zero fill from the right 



PSLLW mmregl, immS 



Privilege: 

Registers Affected: 
Rags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exception 


Real 


Virtual 
8086 


Protected 


Descriplioii 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instmdion bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exeption (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instructbn execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory focation. 


Segment overrun (13) 


X 


X 




One of the instniction data operands falls outside the address range OOOOOh 
toOFFFi=h. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memorv reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL» 3.) 



The PSLLW instruction shifts the four 16-bit operands in the destination operand (an 
MMX register) to the left by the number of bit positions indicated by mmreg2/niem64 
or by immS, the 8-bit immediate operand. The shifted values are zero filled from the 
right. The four 16-bit results are stored in the MMX register specified as the 
destination operand. 
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FuncUonal lOiistratioii of the PSUW Instruction 
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The following list explains the functional illustration of the PSLLW instruction: 

■ The value 0000_0000_0000_0008h in mmreg2/mem64 indicates a shift of 8 bit 

positions to the left. 

■ The 16-bit value 8807h in mmregi is shifted 8 bit positions to the left and stored in 
mmregi as 0700h. 

■ The 16-bit value EC22h in mmregi is shifted 8 bit positions to the left and stored in 
mmregi as 2200h. 

■ The 16-bit value OFFSh in mmregi is shifted 8 bit positions to the left and stored in 
mmregi as F900h. 

■ The 16-bit value FFFFh in mmregi is shifted 8 bit positions to the left and stored in 
mmregi as FFOOh. 

Related Instructions See the PSLLD instruction. 

See the PSLLQ instruction. 
See the PSRAD instruction. 
See the PSRAW instruction. 
See the PSRLD instruction. 
See the PSRLQ instruction. 
See the PSRLW instruction. 
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PSRAD 

mnemonic 



PSRAD mmregl , mmregymem64 OF E2h Shift right arithmetic packed signed 32-bit values in mmregl the 

number of positions in mmregVniem64 with sign fill from the 
left 

PSRAD mmregl, immS OF 72h /4 Shift right arithmetic packed signed 32-bit values in mmregl the 

number of positions in immS with sign fill from the left 



Privilege: none 

Registers Affected: MMX 

Flags Affected: none 
Exceptions Generated: 



Exception 


Real 


Viftual 
8086 


Protected 


Desalption 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instrudnn bft (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (fS) of the control 
register (CRO) is set to 1. 


stack exception (12) 






X 


During instruction execution, the stad segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overun (15) 


X 


X 




One of the instruction data operands falls outskie the address range OOOOOh 
to OFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit. 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control re^ster (CRO) is set to 1. 
(In Protected Mode. CPL= 3.) 



The PSRAD instruction shifts the two signed 32-bit operands in the destination 
operand (an MMX register) to the right by the number of bit positions indicated by 
mnireg2/mem64 or by inimS, the 8-bit immediate operand* The shifted values are sign 
filled from the left. The two signed 32-bit results are stored in the MMX register 
specified as the destination operand. 
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Functional lOnstratlon of the PSRAD instruction 
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The following list explains the functional illustration of the PSRAD instruction: 

■ The value 0000_0000_0000_0010h in mmreg2/mem64 indicates a shift of 16 bit 
positions to the right. 

n The 32-bit negative value FFFO_O00Oh in mmregl is shifted 16 bit positions to the 
right with sign fill from the left and stored in mmregl as FFFF_FFFOh. 

■ The 32-bit positive value 0123„0000h in mmregl is shifted 16 bit positions to the 
right with sign fill from the left and stored in mmregl as 0000_0123h. 



Rdated Instructions 



See the PSLLD instruction. 
See the PSLLQ instruction. 
See the PSLLW instruction. 
See the PSRAW instruction. 
See the PSRLD instruction. 
See the PSRLQ instruction. 
See the PSRLW instruction. 
See the PUNPCKHWD instruction. 
See the PUNPCKLWD instruction. 
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PSRAW 



mnemonic 



opcode descripHon 



PSRAW mmregl,mmreg;^mem64 OFEIh Shift right arithmetic packed signed 1 6-bit values in mmregl the 

number of positions in mmregVmem64 with sign fill from the 
left 

0F71h/4 Shift right arithmetic packed signed 16-bit values in mmregl the 
number of positions in imm8 with sign fill from the left 



PSRAW mmregl, imm8 



Privilege: 

Registers Affected: 
Rags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the iloating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection ()3) 






X 


During instructton execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception Is pending due to the floating-point executmn unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode,CPL = 3.) 



The PSRAW instruction shifts the four signed 16-bit operands in the destination 
operand (an MMX register) to the right by the number of bit positions indicated by 
mmreg2/mem64 or by immS, the 8-bit immediate operand. The shifted values are sign 
filled from the left. The four signed 16-bit results are stored in the MMX register 
specified as the destination operand. 



418 



177AMD0060454 



MMX Multimedia Technology 



Fmctioiial Illustration of the PSRAW Instruction 
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mmreg2/meni64 



mmregl 



mmregl 



0000_OOCO^OOOO_0008h | 


63 0 


8800h 


ECOOh 


OFOOh 


7F00h 1 


63 




0 


FF88h 


FFECh 


OOOFh 


007Fh j 



The following list explains the functional illustration of the PSRAW instruction: 

■ The value 0000_0000_0000_0008h in nmireg2/mem64 indicates a shift of 8 bit 
positions to the right. 

■ The 16-bit negative value 8800h in mmregl is shifted 8 bit positions to the right 
with sign fill from the left and stored in mmregl as FF88h, 

■ The 16>bit negative value ECOOh in mmregl is shifted 8 bit positions to the right 
with sign fill from the left and stored in mmregl as FFECh. 

■ The 16-bit positive value OFOOh in mmregl is shifted 8 bit positions to the right 
with sign fill from the left and stored in mmregl as OOOFh. 

■ The 16-bit positive value 7F00h in mmregl is shifted 8 bit positions to the right 
with sign fill from the left and stored in mmregl as 007Fh. 



Rolated instructions 



See the PSLLD instruction. 
See the PSLLQ instruction. 
See the PSLLW instruction. 
See the PSRAD instruction. 
See the PSRLD instruction. 
See the PSRLQ instruction. 
See the PSRLW instruction. 
See the PUNPCKHBW instruction. 
See the PUNPCKLBW instruction. 
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PSRLD 

mnemonic 



opcode desaiption 



PSRLD mmregl , mmreg2/'neni64 OF D2h Shift right logical packed 32-bit values in mmregl the number of 

positions in mmreg2/mem64 with zero fill from the left 
PSRLD mmregl, immS . .-j,^ __r — 



posiuons in mmreg//memt>4 wnn zero tiii irom me len 
OF 72h /2 Shift right logical packed 32-bit values in mmregl the number of 
positions in immS with zero fill from the left 



Privilege: 

Registers Affected: 
Rags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exception 


Real 


virtual 
8086 


Protected 


Description 


InvaGd opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of tfie control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instrucdon data operands falls outside tiie address range OOOOOh 
toOFFFFh. 


Page fault (14) 


X 


X 


A page fault resulted from the execution of the instruction. 


Roating-point exception 
pending (15) 


X 


X 


X 


An exception ts pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction executioa 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL = 3.) 



The PSRLD instruction shifts the two 32-bit operands in the destination operand (an 
MMX register) to the right by the number of bit positions indicated by 
ninireg2/mem64 or by immS, the 8-bit immediate operand. The shifted values are zero 
filled from the left. The two 32-bit results are stored in the MMX register specified as 
the destination operand. 
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mmregymem64 
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mmregl 



FFFO.OOCOh 



0123„4E67h 



63 
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0000_.FFFOh 



0000,0123b 



The following list explains the functional illustration of the PSRLD instruction: 

■ The value 0000_0000_0000_0010h in mnireg2/mem64 indicates a shift of 16 bit 
positions to the right. 

■ The 32-bit value FFF0_0000h in mmregl is shifted 16 bit positions to the right and 
stored in mmregl as 0000_FFFOh 

■ The 32-bit value 0123_4567h in mmregl is shifted 16 bit positions to the right and 
stored in mmregl as 0000_01231i. 

Related Instructions See the PSLLD instruction. 

See the PSLLQ instruction. 
See the PSLLW instruction. 
See the PSRAD instruction. 
See the PSRAW instruction. 
See the PSRLQ instruction. 
See the PSRLW instruction. 
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PSRLQ 



mnemontc 



opcode description 



PSRLQ mmregl , mmreg^mem64 OF D3h Shift right logical 64-bit values in mmregl the number of 

positions in mmreg2/mem64 with zero fill from the left 

PSRLQ mmregl, imm8 OF 73h /2 Shift right logical 64-bit values in mmregl the number of 

positions in imm8 with zero fill from the left 



Privilege: 

Registers Affected: 
Flags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exception 


Real 


Virtual 
8086 


Protected 


Oesaiption 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRO) is set to L 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state If the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instrudfon execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is sCt to 1. 
(In Protected Mode, CPL=3.) 



The PSRLQ instruction shifts the 64-bit operand in the destination operand (an MMX 
register) to the right by the number of bit positions indicated by mnireg2/mem64 or by 
immS, the 8-bit immediate operand. The shifted value is zero filled from the left. The 
resxilt is stored in the MMX register specified as the destination operand. 
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FuncHonal lllustratloa of the PSRLQ Instruction 
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The following list explains the functional illustration of the PSRLQ instruction: 

■ The value 0000_0000„0000_0010h in mmreg2/niem64 indicates a shift of 16 bit 
positions to the right. 

■ The 64-bit value 000F_A3BE_0123_4567h in mmregl is shifted 16 bit positions to 
the right and stored in mmregl as 0000_OOOF_A3BE_0123h. 



Related Instructions 



See the PSLLD instruction. 
See the PSLLQ instruction. 
See the PSLLW instruction. 
See the PSRAD instruction. 
See the PSRAW instruction. 
See the PSRLD instruction. 
See the PSRLW instruction. 
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PSRLW 



mnemonic 



opcode description 



PSRLW mmregl, mmreg2/mem64 OF Dl h 
PSRLW mmregl,imm8 



Shift right logical packed 1 6-bit values in mmregl the number of 
positions in mmreg2/mem64 with zero fill from the left 
OF 7 1 h /2 Shift right logical packed 1 6-bit values in mmregl the number of 
positions In immS with zero fill from the left 



Privilege: 

Registers Affected: 
Flags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Ixceptfon 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


Hie emulate MMX instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (T5) of the control 
register (CRO) is set to 1. 


Stack exception (12) 




X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 




X 

i 


During instruction executton, the effective address of one of the segment 
registers used for the operand points to an illegal meimry k)cation. 


Segment overrun (13) 


X 


X ' 


One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fauh (14) 




X X 


A page fault resulted from the execution of the instruction. 


Roating-point exception 
pendrrtg (16) 


X 


X : X 


An exceptkm is pending due to the floating-point execution unit 


Alignn)entdieck(17) 




X X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL= 3.) 



The PSRLW instruction shifts the four 16-bit operands in the destination operand (an 
MMX register) to the right by the number of bit positions indicated by 
minreg2/mem64 or by immS, the 8-bit immediate operand. The shifted values are zero 
filled from the left. The four 16-bit results are stored in the MMX register specified as 
the destination operand. 
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The following list explains the functional illustration of the PSRLW instruction: 

■ The value 0000_0000_0000_0008h in mmreg2/mem64 indicates a shift of 8 bit 
positions to the right. 

■ The 16-bit value SSOOh in nunregl is shifted 8 bit positions to the right and stored 
in mmregl as 0088h. 

■ The 16-bit value EC22h in mmregl is shifted 8 bit positions to the right and stored 
in mmregl as OOECh. 

■ The 16-bit value 0FF9h in mmregl is shifted 8 bit positions to the right and stored 
in mmregl as OOOFh. 

■ The 16-bit value FFOOh in mmregl is shifted 8 bit positions to the right and stored 
in mmregl as OOFFh. 



Related Instructions 



See the PSLLD instruction. 
See the PSLLQ instruction. 
See the PSLLW instruction. 
See the PSRAD instruction. 
See the PSRAW instruction. 
See the PSRLD instruction. 
See the PSRLQ instruction. 
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PSUBB 



mnemonic 



opcode desaiption 



PSUBB mmregl, mmreg2/mem64 OF FBh Subtract unsigned packed 8-bit values with wraparound 



Privilege: 
Registers Affected: 
Flags Affected: 
Exceptions Generated: 



none 
MMX 
none 



EXCGfltiOO 


Real 


Virtual 
8086 


Protected 


Descriptioa 


invaHd opcode (6) 


X 


X 


X 


The emulate MMX instnidkin bit (EM) of the control register (CRO) is set to ]. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task »vitch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instructkm execution, the stack segment limit was exceeded. 


General protection (13) 




X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X X 




One of the instnidion data operands falls outside the address range OOOOOh 
to OFFFFh. 


Page fault (14) 


X 


X 


A page fault resulted from the execution of the instrudbn. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution uniL 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instmction execution, 
and die alignment mask tMt (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL= 3.) 



The PSUBB instruction subtracts eight unsigned 8-bit values in the source operand (an 
MMX register or a 64-bit memory location) from the eight corresponding unsigned 
8-bit values in the destination operand (an MMX register). If the source operand is 
larger than the destination operand, the result wraps around. 
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63 



miTireg^mem64 



OOh 88h ECh 



OOh 14h 44h 



F7h A8h 



63 



mmregl 



OOh 4Ah 67h 42h 



63h 2Ch lOh 



F2h 



The following list explains the functional illustration of the PSUBB instruction: 

■ The unsigned 8-bit value ECh is subtracted from the unsigned 8-bit value 53h and 
wraps around to 67h. 

■ The unsigned 8-bit value F7h is subtracted from the unsigned 8-bit value 07h and 
wraps around to lOh. 

■ The unsigned 8-bit value A8h is subtracted from the unsigned 8-bit value 9Ah and 
wraps around to F2h. 

■ All the remaining operations are simple subtraction with no wraparound. 

Related Instructions See the PSUBD instruction. 

See the PSUBW instruction. 
See the PSUBSB instruction. 
See the PSUBSW instruction. 
See the FSUBUSB instruction. 
See the PSUBUSW instruction. 
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PSUBD 



mnemonic 



opcode description 



PSUBD mmreg 1 , mmreg2/mem64 OF FAh Subtract unsigned packed 32-bit values with wraparound 



Privilege: 

Registers Affected: 
Flags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instnjctnn bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-pof nt or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outskle the address range OOOOOh 
toOFFFFh. 


Pagefauft (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception b pending due to the floatingiwint execution unit 


Alignnnent check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL = 3.) 



The PSUBD instruction subtracts two unsigned 32-bit values in the source operand (an 
MMX register or a 64-bit memory location) from the two corresponding unsigned 
32-bit values in the destination operand (an MMX register). If the source operand is 
larger than the destination operand, the result wraps around. 
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mmregl 
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0123_' 



.4567h 



63 



0 



mmreg2/mem64 



0O0F_A3BEh 



8QOO_O0O0h 



65 



0 



mmregl 



FFE0_B885h 



8123_4567h 



The following list explains the functional illustration of the PSUBD instruction: 

■ The unsigned 32-bit value 8000_0000h is subtracted from the unsigned 32-bit value 
0123_4567h and wraps around to 8123_4567h. 

■ The remaining operation is a simple subtraction with no wraparound. 

Related Instructions See the FSUBB instruction. 



See the PSUBW instruction. 
See the PSUBSB instruction. 
See the PSUBSW instruction. 
See the PSUBUSB instruction. 
See the PSUBUSW instruction. 
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PSUBSB 



mnemomc 



opcode desaiption 



PSUBSB mmregK mmreg2/mem64 OF E8h Subtract signed packed 8-bit values and saturate 



Privilege: 

Registers Affected: 
Flags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opoxle (6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRD) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MAAX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overmn (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Roating-point exception 
pending (15) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resuhed from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL= 3.) 



The PSUBSB instruction subtracts eight signed 8-bit values in the source operand (an 
MMX register or a 64-bit memory location) from the eight corresponding signed 8-bit 
values in the destination operand (an MMX register). If a result is less than -128 (80h), 
it saturates to -128 (80h). If a result is greater than 127 (7Fh), it saturates to 127 (7Fh). 
The eight signed 8-bit results are stored in the MMX register specified as the 
destination operand. 
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FuacUoNl IDuslratlon of the PSUBSB inslnictloii 
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0 


OFh 
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44h 




A8h b 



mmregl 
■ Indicates a saturated value 



80h 



4Ah 67h r7Fh 63h 2Ch lOh 



F2h 



The following list explains the functional illustration of the PSUBSB instruction: 

■ The signed 8-bit positive value OFh is subtracted from the signed 8-bit negative 
value 82h, and the result saturates to BOh because it is less than SOh, the smallest 
possible signed 8-bit value. 

■ The signed 8-bit negative value Clh is subtracted from the signed 8-bit positive 
value 42h, and the result saturates to 7Fh because it is greater than 7Fh, the 
largest possible signed 8-bit value. 

■ All the remaining operations are simple signed subtraction with no saturation. 

Related Instructions See the PSUBB instruction. 

See the PSUBD instruction. 

See the PSUBW instruction. 
See the PSUBSW instruction. 
See the PSUBUSB instruction. 
See the PSUBUSW instruction. 
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PSUBSW 

mnemonic opcode desaiption 

PSUBSW mmregl , mmrefft/tnms^ OF E9h Subtract signed packed 1 6-bit values and saturate 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 

Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


Ihe emulate MIVIX instruction bit (EM) of the control register (CRO) b set to 1. 


Device not availaUe (7) 


X 


X 


X 


Save the floating-potm or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 


X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 


X 


X 


X 


During instructnn execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit. 


Allgmnent check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected ModerCPL = 3.) 



The PSUBSW instruction subtracts four signed 16-bit values in the source operand (an 
MMX register or a 64-bit memory location) from the four corresponding signed 16-bit 
values in the destination operand (an MMX register). If a result is less than -32768 
(8000h), it saturates to -32768 (8000h). If a result is greater than 32767 (7FFFh), it 
saturates to 32767 (7FFFh). The four signed 16-bit results are stored in the MMX 
register specified as the destination operand. 
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hmcUoMl niaslnrtira of the PSUBSW hstnicOon 
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I 
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mmregl 
mnireg2/mem64 

mmregl 
Indicates a saturated value 



The following list explains the functional illustration of the PSUBSW instruction: 

■ The signed 16-bit negative value D320h is subtracted from the signed 16-bit 
positive value 5321h, and the result satixrates to 7FFFh because it is greater than 
7FFFh, the largest possible signed 16-bit value. 

■ The signed 16-bit positive value 0FF9h is subtracted from the signed 16-bit 
negative value 8007h, and the result satiurates to 8000h because it is less than 
8000h, the smallest possible signed 16-bit value. 

■ The remaining operations are simple signed subtraction with no saturation. 



Related Instructions 



See the PSUBB instruction. 
See the PSUBD instruction. 
See the PSUBW instruction. 
See the PSUBSB instruction. 
See the PSUBUSB instruction. 
See the PSUBUSW instruction. 
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PSUBUSB mmregl , mmregl/memS^ OF D8h Subtract unsigned packed 8-bit values and saturate 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 
Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 

x" 


X 


X 


The emulate MMX instruction bit (Eftfl) of the control register (Cftt)) is set to I 


Device not available (7) 


X 


X 


Save the floating-point or MMX slate if the tasic switch bit (TS) of the control 
register (CRO) IS set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (15) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an iilega) memory locatioa 


Segment overrun (13) 


X 


X 




One of the instruction data operands faUs outside the address range oooooh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating^int execution unit. 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, GPL =3.) 



The PSUBUSB instruction subtracts eight unsigned 8-bit values in the source operand 
(an MMX register or a 64-bit memory location) from the eight corresponding unsigned 
8-bit values in the destination operand (an MMX register). If any 8-bit source value is 
greater than its corresponding 8-bit destination value, the result saturates to OOh. The 
eight unsigned 8-bit results are stored in the MMX register specified as the 
destination operand. 
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mmreg2/mem64 



^ mmregl 
Indicates a saturated value 



OFh 


88h| ECh |cih 


14h 


44h 1 F7h 


98h h 


63 








0 


73h 


4Ah poOh plOn 


63h 


2Ch foOh 


02h b 



The following list explains the functional illustration of the PSUBUSB instruction: 

■ The unsigned 8-bit value ECh is subtracted from the unsigned 8-bit value 53h, and 
the result saturates to OOh because the source operand is greater than the 
destination operand. 

■ The unsigned 8-bit value Clh is subtracted from the unsigned 8-bit value 42h, and 
the result saturates to OOh because the source operand is greater than the 
destination operand. 

■ The unsigned 8-bit value F7h is subtracted from the unsigned 8-bit value 07h, and 
the result saturates to OOh because the source operand is greater than the 
destination operand. 

■ All the remaining operations are simple tmsigned subtraction with no saturation. 

Related Instructions See the PSUBB instruction. 

See the PSUBD instruction. 
See the PSUBW instruction. 
See the PSUBSB instruction. 
See the PSUBSW instruction. 
See the PSUBUSW instruction. 
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PSUBUSW 

mnemonic opcode description 



PSUBUSW mmregl , mmreg2/mem64 OF D9h Subtract unsigned packed 1 6-bit values and saturate 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 

Exceptions Generated: 



Exception 


Real 


Virtual 
B086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
regster (CRO) Is set to l. 


Stack exception (12) 






During instruction executfoa the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of tiie segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 


One of the instruction data operands falls outside the address range OOOOOh 
to OFFFFh. 


Pagefauh(14) 




X 


X 


A page fauh resulted from the execution of the instructnn. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception Is pending due to tfie floating-point execution unit 


Alignment check (17) 




X X 

i 


An unaligned memory reference resulted from the instructk)n execution 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
On Protected Mode, CPL = 3.) 



The PSUBUSW instruction subtracts four unsigned 16-bit values in the source 
operand (an MMX register or a 64-bit memory location) from the four corresponding 
unsigned 16-bit values in the destination operand (an MMX register). If any 16-bit 
source value is greater than its corresponding 16-bit destination value, the result 
saturates to OOOOh. The four unsigned 16-bit results are stored in the MMX register 
specified as the destination operand. 
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RinctkNial IButraflon of the PSUBUSW histniclioa 
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70O7h 
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0 


4A49h 


OOOOh 


600Eh 


OOOOh 1 



■ Indicates a saturated value 

The following list explains the functional illustration of the PSUBUSW instruction: 

■ The unsigned 16-bit value EC22h is subtracted from the unsigned 16-bit value 
5321h, and the result saturates to OOOOh because the source operand is greater 
than the destination operand. 

■ The remaining operations are simple unsigned subtraction with no saturation. 

Related Instructions See the PSUBB instruction. 

See the PSUBD instruction. 
See the PSUBW instruction. 
See the PSUBSB instruction. 
See the PSUBSW instruction. 
See the PSUBUSB instruction. 
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PSUBW 



mnemonic 



opcode description 



PSUBW mmregl, mmreg2/mem64 OF F9h Subtract unsigned packed 16-bit values with wraparound 



Privilege: 

Registers Affected: 
Flags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exception 


Real 


virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


x 


X 


The emulate MMX instriidion bit (EM) of the control register (CRO) is set to 1. 


Device not avaflable (7) 


x~ 


x 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment ItmH was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an Olegal memory location. 


Segment overrun (13) 


X 


x 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




x 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


x 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control r^ter (CRO) is set to 1. 
(In Protected Mode, CPL= 3.) 



The PSUBW instruction subtracts four unsigned 16-bit values in the source operand 
(an MMX register or a 64-bit memory location) from the four corresponding unsigned 
16-bit values in the destination operand (an MMX register). If the source operand is 
larger than the destination operand, the result wraps around. 
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Fundional Ulustratioii of the PSUBW InstrucUon 
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The following list explains the functional illustration of the PSUBW instruction: 

■ The unsigned 16-bit value £C22h is subtracted from the unsigned 16-bit value 
5321h and the result v«raps around to 66FFh. 

■ The remaining operations are simple unsigned subtraction with no saturation. 



Related Instructions 



See the PSUBB instruction. 
See the PSUBD instruction. 
See the PSUBSB instruction. 
See the PSUBSW instruction. 
See the PSUBUSB instruction. 
See the PSUBUSW instruction. 
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PUNPCKHBW 



mnemoni c opco de descri ption 

PUNPCKHBW mmregl, mmreg2/mem64 OF 68h Unpack the high 32 bits of packed 8-bit values 

Privilege: none 
Registers Affected: MMX 
Flags Affected: none 

Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 

register (CRO) is set to l. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
to OFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point exKution unit. 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL= 3.) 



The PUNPCKHBW instruction unpacks and interleaves four 8-bit values from the high 
32 bits of the source operand (an MMX register or a 64-bit memory location) and four 
8-bit values from the high 32 bits of the destination operand (an MMX register). The 
8-bit values from the source operand become the high 8 bits of the 16-bit results, and 
the 8-bit values from the destination operand become the low 8 bits of the 16-bit 
results. The eight interleaved 8-bit values are stored in the MMX register specified as 
the destination operand. 
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hincUonal Uliistratton of the PUNPCKHBW Instniction 

In the following figure, the destination register is shown at the center to illustrate the 
flow of data from the two source operands. 



source mmreg:Vniem64 



destination mmregl 



source mmregl 




In the functional illustration of the PUNPCKHBW instruction, the 8-bit values from 
mmregl are stored in the low-order 8 bits of the 16-bit result. The mmreg2/mem64 
source operand is set to all zero bits so it can provide zero fill in the high-order 8 bits of 
the 16-bit result. This is a method that can be used to expand unsigned 8-bit values 
into unsigned 16-bit operands for subsequent processing that requires higher 
precision. 



Related Instructions 



See the PACKSSWB instruction. 
See the PACKUSWB instruction. 
See the PSRAW instruction. 
See the PUNPCKHDQ instruction. 
See the PUNPCKHWD instruction. 
See the PUNPCKLBW instruction. 
See the PUNPCKLDQ instruction. 
See the PUNPCKLWD instruction. 
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PUNPCKHDQ 



mnemonic 



opcode desaiption 



PUNPaHDQ mmregl, mmreg?/mem64 OF 6Ah Unpack the high 32 bits of packed 32-bit values 

Privilege: none 

Registers Affected: MMX 

Rags Affected: none 
Exceptions Generated: 



Exceptton 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 
X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the efiiective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment oven'un (13) 


X 


X 




One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the axttrfA register (CRO) is set to 1. 
(In Protected ModerCPL= 3.) 



The PUNPCKHDQ instruction unpacks and interleaves the high 32 bits of the source 
operand (an MMX register or a 64-bit memory location) and the high 32 bits of the 
destination operand (an MMX register). The 32-bit value from the source operand 
becomes the high 32 bits of the 64-bit result, and the 32-bit value from the destination 
operand becomes the low 32 bits of the 64-bit result. The interleaved 32-bit values are 
stored in the MMX register specified as the destination operand. 
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Foncdonal lOustration of the PUNKKHDQ Instrudioii 

In the following figure, the destination register is shown at the center to illustrate the 
flow of data from the two source operands. 



source mmreg2/mcmW 



destinaibn mmregl 



source mmregl 



63 



OOOO^OOOOh 



0000_0000h 




8880»44A8h 



7F06_F£80n 
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In the functional illustration of the PUNPCKHDQ instruction, the 32-bit value from 
mmregl is stored in the low-order 32 bits of the 64-bit result. The mmreg2/mem64 
source operand is set to all zero bits so it can provide zero fill in the high-order 32 bits 
of the 64-bit result. This is a method that can be used to expand unsigned 32-bit values 
into unsigned 64-bit operands for subsequent processing that requires higher 
precision. 



Related Instrnctions 



See the PUNPCKHBW instruction. 
See the PUNPCKHWD instruction. 
See the PUNPCKLBW instruction. 
See the PUNPCKLDQ instruction. 
See the PUNPCKLWD instruction. 
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PUNPCKHWD 

mnemonic opcode description 

PUNPCKHWD mmreg1,mnireg2/mem64 0F69h Unpack the high 32 bits of packed 16-bitvalues 

Privilege: none 

Registers Affected: MMX 

Flags Affected: none 
Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 
X 


The emulate MMX Instrudkm bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 

During instruction execution, the stack segment limit was aceeded. 


Stack exception (12) 






X 


General protection <13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to sn illegal memory location. 


Segment overrun (13) 


X 


X 




One of the Instruction data operands falk outside the address range OOOOOh 
toDFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unali^ed memory reference resulted from the instruction executioa 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL= 3.) 



The PUNPCKHWD instruction unpacks and interleaves two 16-bit values from the 
high 32 bits of the source operand (an MMX register or a 64-bit memory location) and 
two 16-bit values from the high 32 bits of the destination operand (an MMX register). 
The 16-bit values from the source operand become the high 16 bits of the 32-bit 
results, and the 16-bit values from the destination operand become the low 16 bits of 
the 32-bit results. The four interleaved 16-bit values are stored in the MMX register 
specified as the destination operand. 
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FuncUooal llbstration of the PUNPCKHWD Instruction 

In the following figure, the destination register is shown at the center to illustrate the 
flow of data from the two source operands. 



63 
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In the functional illustration of the PUNPCKHWD instruction, the 16-bit values from 
mmregl are stored in the low-order 16 bits of the 32-bit result. The 16-bit values from 
the mmreg2/mem64 source operand are stored in the high-order 16 bits of the 32-bit 
result. This is an example of the use of the PUNPCKHWD instruction to assemble 
32-bit operands from the high and low 16-bit results produced by the PMULHW and 
PMULLW instructions. In this example, the high and low 16-bit results are interleaved 
to produce the signed 32-bit results 1569_4030h and F98C^7662h. 

Related Instructions See the PACKSSDW instruction. 

See the PSRAD instruction. 
See the PMULHW instruction. 
See the PMULLW instruction. 
See the PUNPCKHBW instruction. 
See the PUNPCKHDQ instruction. 
See the PUNPCKLBW instruction. 
See the PUNPCKLDQ instruction. 
See the PUNPCKLWD instruction. 
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PUNPCKLBW 



mnemomc 



opcode description 



PUNPCKLBW mmregl, mmr^gt/merm OF 60h Unpack the low 32-bils of packed 8-bit values 



Privilege: 

Registers Affected: 
Flags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switdi bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instrurtion execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (13) 


X 


X 




One of the instruction data operands fells outside the address range OOOOOh 
toOFFFFh. 


Pagefauh (14) 




X 


X 


A page fauK resulted from the execution of the instruction. 


Ftoating-point exception 
pending (16) 


X 


X 


X 


An exceptkm is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instmction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL» 3.) 



The PUNPCKLBW instruction unpacks and interleaves four 8-bit values from the low 
32 bits of the source operand (an MMX register or a 64-bit memory location) and four 
8-bit values from the low 32 bits of the destination operand (an MMX register). The 
8-bit values from the source operand become the high 8 bits of the 16-bit results, and 
the 8-bit values from the destination operand become the low 8 bits of the 16-bit 
results. The eight interleaved 8-bit values are stored in the MMX register specified as 
the destination operand. 
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FuncUonal lliustralion of the PUNPCKLBW Instradion 

In the following figure, the destination register is shown at the center to illustrate the 
flow of data from the two source operands. 



source mmreg2/mem64 



destination mmregl 



source mmregl 




In the functional illustration of the PUNPCKLBW instruction, the 8-bit values from 
mmregl are stored in the low-order 8 bits of the 16-bit result. The mmreg2/mem64 
source operand is set to all zero bits so it can provide zero fill in the high-order 8 bits of 
Che 16-bit result. This is a method that can be used to expand unsigned 8-bit values 
into unsigned 16-bit operands for subsequent processing that requires higher 
precision. 



Related InstnictUins 



See the PACKSSWB instruction. 
See the PACKUSWB instruction. 
See the PSRAW instruction. 
See the PUNPCKHBW instruction 
See the PUNPCKHDQ instruction. 
See the PUNPCKHWD instruction. 
See the PUNPCKLDQ instruction. 
See the PUNPCKLWD instruction. 
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PUNPCKLDQ 

mnemonic opcode description 

PUNPaiDQ mmregl , mmres|2/meni64 OF 62h Unpack the low 32 bits of packed 32-bit values 

Privilege: none 

Registers Affected: MMX 

Rags Affected: none 

Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instniction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating point or MMX state if the task switch bit (FS) of the control 
register (CRO) is set to 1. 


stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
regbters used for the operand points to an Illegal memory bcation. 


Segment overrun (13) 


X 


X 


One of the instruction data operands falls outside the address range OOOOOh 
toOFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16] 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
On Protected Mode; CPL= 3.) 



The PUNPCKLDQ instruction unpacks and interleaves the low 32 bits of the source 
operand (an MMX register or a 64-bit memory location) and the low 32 bits of the 
destination operand (an MMX register). The 32-bit value from the source operand 
becomes the high 32 bits of the 64-bit result, and the 32-bit value from the destination 
operand becomes the low 32 bits of the 64-bit result. The interleaved 32-bit values are 
stored in the MMX register specified as the destination operand. 
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FuncUoiial lllustation of the PUNKKLDQ Instruction 

In the following figure, the destination register is shown at the center to illustrate the 
flow of data from the two source operands. 
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In the functional illustration of the PUNPCKLDQ instruction, the 32-bit value from 
mmregl is stored in the low-order 32 bits of the 64-bit result. The mmreg2/mem64 
source operand is set to aU zero bits so it can provide zero fill in the high-order 32 bits 
of the 64-bit result. This is a method that can be used to expand unsigned 32-bit values 
into unsigned 64-bit operands for subsequent processing that requires higher 
precision. 

Related Instructions See the PUNPCKHBW instruction. 

See the PUNPCKHDQ instruction. 
See the PUNPCKHWD instruction. 
See the PUNPCKLBW instruction. 
See the PUNPCBOLWD instruction. 
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PUNPCKLWD 

mnemonic opcode description 

PUNPCKLWD mmregl, inmreg2/mem64 OF 61 h Unpack the low 32 bits of packed 1 6-bit values 

Privilege: none 

Register Affected: MMX 

Flags Affected: none 

Exceptions Generated: 



Exception 


Real 


Virtual 
8086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EMO of the control register (W^ 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state if the task switch bit (TS) of the control 
register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction executioa the effective address of one of the segment 
registers used for the operand points to an illegal memory kKation. 


Segment overrun (]3) 


X 


X 




One of the instruction data operands foils outside the address range OOOOOh 
to OFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instructioa 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution unit 


Alignment check (17) 




X 


X 


An unaligned memory reference resulted from the instruction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
On Protected Mode. CPL = 3.) 



The PUNPCKLWD instruction unpacks and interleaves two 16-bit values from the low 
32 bits of the source operand (an MMX register or a 64-bit memory location) and two 
16-bit values from the low 32 bits of the destination operand (an MMX register). The 
16-bit values from the source operand become the high 16 bits of the 32-bit results, 
and the 16-bit values from the destination operand become the low 16 bits of the 32-bit 
results. The four interleaved 16-bit values are stored in the MMX register specified as 
the destination operand* 
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FuncHonal lUustration of the PUNPCKLWD InstracHon 

In the following figure, the destination register is shovm at the center to illustrate the 
flow of data from the two source operands. 
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In the functional illustration of the PUNPCKLWD instruction, the 16-bit values from 
mmregl are stored in the low-order 16 bits of the 32-bit result. The 16-bit values from 
the nmireg2/mem64 source operand are stored in the high-order 16 bits of the 32-bit 
result. This is an example of the use of the PUNPCKLWD instruction to assemble 
32-bit operands from the high and low 16-bit results produced by the PMULHW and 
PMULLW instructions. In this example, the high and low 16-bit results are interleaved 
to produce the signed 32-bit results 06FD_5FCFh and 0000_0001h. 



Related Instructions 



See the PACKSSWD instruction. 
See the PSRAD instruction. 
See the PMULHW instruction. 
See the PMULLW instruction. 
See the PUNPCKHBW instruction. 
See the PUNPCKHDQ instruction. 
See the PUNPCKHWD instruction. 
See the PUNPCKLBW instruction. 
See the PUNPCKLDQ instruction. 
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PXOR 



mnemonic 



opcode desaiption 



PX0Rmmreg1,ninireg2/mem64 OFEFh XOR 64-bit values 



Privilege: 

Registers Affected: 
Hags Affected: 
Exceptions Generated: 



none 
MMX 
none 



Exception 


Real 


Virtual 
B086 


Protected 


Description 


Invalid opcode (6) 


X 


X 


X 


The emulate MMX instruction bit (EM) of the control register (CRO) is set to 1. 


Device not available (7) 


X 


X 


X 


Save the floating-point or MMX state If the task switch bit (TS) of the control 

register (CRO) is set to 1. 


Stack exception (12) 






X 


During instruction execution, the stack segment limit was exceeded. 


General protection (13) 






X 


During instruction execution, the effective address of one of the segment 
registers used for the operand points to an illegal memory location. 


Segment overrun (15) 


X 


X 




One of the Instnjction data operands falls outside the address range OOOOOh 
to OFFFFh. 


Page fault (14) 




X 


X 


A page fault resulted from the execution of the instruction. 


Floating-point exception 
pending (16) 


X 


X 


X 


An exception is pending due to the floating-point execution un'rt. 


Alignment chedc (17) 




X 


X 


An unaligned memory reference resulted from the instiuction execution, 
and the alignment mask bit (AM) of the control register (CRO) is set to 1. 
(In Protected Mode, CPL= 3.) 



The PXOR instruction logically XORs the 64 bits of the source operand (an MMX 
register or a 64-bit memory location) with the 64 bits of the destination operand (an 
MMX register) and stores the result in the destination register. 

A logical XOR produces a 1 bit if only one of the two input bits is a 1. If both input bits 
are 0 or both input bits are 1, a logical XOR produces a 0 bit. 
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FuncUoiial Diustratlon of the PXOR Instrudion 



48 47 



mmregl 

32 51 



16 15 



ioio_] 1 1 i_oooo_iioi |00oo_i 1 ii^oooc.m i 



1100_C001_0011_0001|]000_1100_1101_0011 



Logical OR 



Logical OR 



Logical OR 



Logical OR 



63 



48 47 



mmregymem64 
32 31 



16 15 



0101_1100_1 100^0011 )I100_L1GU0100„U10 



101l_OOOl_OOll_1001t0110_0011_0101.10Cl| 



63 



48 47 



Result 
mmregl 

32 31 



16 15 



T" 



iiii_ooii_noo_nio |iioo_ooio_oioo_oooi 
I 



0:11_0000_0000_1000|1110 1111 IOOO..IOOOI 



In the functional illustration of the PXOR instruction, the 64-bit source value is 
logically XOR'd to the 64-bit destination value, and the result is stored in the 
destination register. 

Related Instructions See the PAND instruction. 

See the PANDN instruction. 
See the POR instruction. 
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Appendix B 



Code Optimization 



Introduction 



The AMD-K6 3D processor can efficiently execute code written 
for previous-generation x86 processors. However, to get the 
highest performance from the unique microarchitecture of the 
processor, certain code optimization techniques should be 
applied. 

This appendix contains information to assist programmers in 
creating optimized code for the processor. This information is 
targeted at compiler/assembler designers and assembly 
language programmers writing high-performance code 
sequences. It is assumed that the reader possesses an in-depth 
knowledge of the x86 architecture. 
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The AMD-K6 Family of Processors 



Processors in the AMD-K6 family use a decoupled instruction 
decode and superscalar execution microarchitecture, including 
state-of-the-art RISC design techniques, to deliver 
sixth-generation performance with full x86 binary software 
compatibility. An x86 binary-compatible processor implements 
the industry-standard x86 instruction set by decoding and 
executing the x86 instruction set as its native mode of 
operation. Only this native mode permits delivery of maximum 
performance when running PC software. 

The AMD-K6 3D Processor 



The AMD-K6 3D processor brings superscalar RISC 

performance to desktop systems running industry-standard x86 
software. This processor implements advanced design 
techniques such as: 

■ Instruction pre-decoding 

■ Multiple x86 opcode decoding 

■ Single-cycle internal RISC operations 

■ Multiple parallel execution units 

■ Out-of-order execution 

■ Data-forwarding 

■ Register renaming 

■ Dynamic branch prediction 

The processor is capable of issuing, executing, and retiring 
multiple x86 instructions per cycle, resulting in superior 
scaleable performance. 

Although the processor is capable of extracting code 
parallelism out of off-the-shelf, commercially available x86 
software, specific code optimizations for the AMD-K6 3D 
processor can result in significantly higher delivered 
performance. This appendix describes the RISC86 
microarchitecture in the processor and makes 
recommendations for optimizing execution of x86 software on 
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the processor. The coding techniques for achieving peak 
performance on the AMD-K6 3D processor include, but are not 
limited to, those recommended for the Pentium, Pentium n, and 
Pentium Pro processors. However, many of these optimizations 
are not necessary for the AMD-K6 3D processor to achieve 
maximum performance. For example, due to more flexible 
pipeline control in the microarchitecture, the AMD-K6 3D 
processor is less sensitive to instruction selection and the 
scheduling of code. This flexibility is one of the distinct 
advantages of the AMD-K6 3D processor microarchitecture. 

In addition to the ability to perform multimedia operations, the 
AMD-K6 3D processor includes the first implementation of the 
3D instruction set. 3D technology was created based on 
suggestions from leading graphics and software vendors^ 
Utilizing a data format and Single Instruction Multiple Data 
(SIMD) operations based on the MMX instruction model, the 
processor can produce up to four, 32-bit, single-precision 
floating-point results per clock cycle. 3D technology also 
includes new integer multimedia instructions, a new 
instruction to allow the prefetching of data under software 
control, and a faster enter/exit multimedia-state instruction. 
For more information, see Chapter 4, "3D Technology" on page 
81 and Appendix A, "MMX Multimedia Technology** on page 
347. 

The 3D units provide support for high-performance, 
floating-point vector operations, which can replace x87 
instructions and enhance the performance of 3D graphics and 
other floating-point-intensive applications. The complete 
multimedia processing unit in the processor combines existing 
MMX instructions with the new 3D instructions. The 3D 
instructions share the use of the MMX registers with the 
multimedia unit. By merging 3D instructions with MMX 
instructions, it now becomes possible to write x86 programs 
containing both integer and floating-point instructions without 
a performance penalty for intermixing MMX and x87 
floating-point instructions. All these improvements have been 
carefully designed to bring a better multimedia experience to 
mainstream PC users while maintaining backwards 
compatibility with all existing x86 software. 
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Execution Units and Dependency Latencies 



The processor contains several specialized execution 
pipelines — store, load, register X, register Y, floating-point, and 
branch condition. Each pipeline operates independently and 
handles a specific subset of the RISC86 instruction set. The 
register X and register Y pipelines each contain integer, 
multimedia, and 3D execution resources, some of which are 
shared between the two. This section describes the operation of 
these units, their execution latencies, and how these latencies 
affect concurrent dependency chains. 

Note: The multimedia execution unit executes MMX instructions, 

A dependency occurs when data needed in one execution 
unit/resource is being processed in another unit/resource (or a 
different stage of the same unit/resource). Additional latencies 
can occur because the dependent execution unit must wait for 
the data from the supplying unit. Table 86 on page 465 provides 
a summary of the execution units, the operations performed 
within these units, the operation latency, and the operation 
throughput. 

Exeoitkm Unit Terminology 

Introdaction The execution units operate with two different types of register 

values — operands and results. Of these there are three types of 
operands and two types of results. 

Operands The three types of operands are as follows: 

■ Address register operands — used for address calculations of 
load and store operations 

■ Data register operands — used for register operations 

■ Store data register operands — used for memory stores 

Results The two types of results are as follows: 

■ Data register results — produced by load or register operations 

■ Address register results — ^produced by Lea or Push operations 
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The following examples illustrate the operand and result 

definitions: 



Add 



AX. BX 



The Add operation has two data register operands (AX, 
and BX) and one data register result (AX). 

Load BX, [SP+^l-CX+SJ 

The Load operation has two address register operands (SP 
and CX as base and index registers, respectively) and a 
data register result (BX). 

Store [SP+4-CX+8], AX 

The Store operation has a store data register operand (AX) 
and two address register operands (SP and CX as base and 
index registers, respectively). 

Lea SI. [SP+4-CX+8J 

The Lea operation (a type of store operation) has address 
register operands (SP and CX as base and index registers, 
respectively), and an address register result 



Six-Stage Pipeline 

To help visualize the operations within the AMD-K6 3D 
processor. Figure 115 illustrates the effective pipeline stages. 
This is a simplified illustration in that the processor contains 
multiple parallel pipelines (starting after common instruction 
fetch and x86 decode pipe stages), and these pipelines often 
execute operations out-of-order with respect to each other. This 
view of the processor execution pipeline illustrates the effect of 
execution latencies for various types of operations. 

For many instructions, the effective pipeline is seven stages. 
For register operations that do not require execution stage 2, 
the effective pipeline is six stages. 



instruction 


x86->RISC86 


Riscee 


Operand 


Execution 


Execution 


Commit 


Fetdi 


Decode 


Op issue 


Fetch 


Stage 1 


Stage 2* 





Note: * Execution Stage 2 is optional 



Figure 115. Processor Pipeline 
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Register Execution Units 

The register execution resources are attached to the register X 
unit execution pipeline and the register Y unit execution 
pipeline. Each register execution pipeline has dedicated 
resources that consist of an integer execution unit and an 
multimedia/ALU execution unit. In addition, both pipelines can 
use shared execution units for 3D operations and MMX shift 
and multiply operations. Figure 6 on page 20 shows the details 
of the register X and Y execution pipelines. 

The register X integer ALU execution resource can execute all 
ALU operations including ALU, multiply, divide (signed and 
unsigned), shift, and rotate. Data register results are available 
after a minimum of one clock of execution latency. 

The dedicated integer execution unit contained within the 
register Y execution pipeline can execute the basic word and 
doubleword ALU operations (ADD, AND, CMP, OR, SUB and 
XOR), zero-extend, and sign-extend operations. Data register 
results are available after one clock. 

The register X and Y execution pipelines each contain a 
dedicated MMX execution unit that handles add/subtract, 
logical, and pack/unpack MMX instructions. The multimedia 
ALU units are symmetrical and can be used simultaneously. 
This means that the processor can execute 2 multimedia ALU 
cycles each clock cycle. 

A number of execution resources are available to both the 
register X and Y execution pipelines. These shared resources 
include the MMX shifter, 3D ALU, and the combined MMX/3D 
multiplier. Figure 50 on page 91 shows which instruction types 
are associated with the various execution pipelines. 

Any combination of two operations that do not utilize the same 
shared execution resource can be issued and executed 
simultaneously. For example, the following pairs of register 
operations can execute together — MMX logical and 3D add, 3D 
add and 3D multiply, MMX multiply and 3D add, etc. If issued 
simultaneously, the following examples result in resource 
contentions and the stall of one RISC86 operation: MMX 
multiply and 3D multiply, two MMX multiplies, two 3D 
multiplies, two 3D adds, etc. 
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Figtire 116 shows the data flow circhitecture of the single-stage 
or double-stage integer execution unit pipeline. There are few 
operations (such as integer multiply) that require a second 
execution stage. The operation issue and operand fetch stage 
(execution stage 0) that precede this execution stage are not 
part of the execution pipeline. The data register result is 
produced near the end of the execution pipe stage. 



I Data Register Operands 
, (Base and Index) 




Figure 1 16. Register X and Y Execution Stages 
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The load unit is a two-stage pipelined design that performs data 
memory reads. It has a two-clock latency from the time it 
receives the address register operands until it produces a data 
register result on a data cache hit. A cache miss produces longer 
latencies. The load unit and the data cache support 
hit-under-miss operations where a load operation bypasses a 
previous load operation that is stalled waiting for a cache line 
refill. This unit uses two address register operands and a 
memory data value as inputs, and produces a data register 
result. 

Memory read data can come from either the data cache or from 
the store queue entry (for a recent store). If the data is 
forwarded from the store queue, there is zero additional 
execution latency, which means that a dependent load 
operation can complete its execution one clock after a store 
operation completes execution. 

Figure 117 shows the architecture of the two-stage load 
execution pipeline. The address register operands are received 
at the end of the operand fetch pipe stage, and the data register 
result is produced near the end of the second execution pipe 
stage. The operation issue and fetch stages that precede this 
execution stage are not shown. 



I T 

I Address Register 
operands 



(Base and Index) 



I Memory data from Data | 
I Cache or Store Queue 




Execution Stage I 

Address Calculation 
Stage 



Execution Stage 2 

DataCach^ 
Sore Queue Lookup 



I 1 

Data Register Result i 



Figure 117. Load Execution Unit 
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store Unit 

The store execution unit is a two-stage pipelined design that 
performs data memory writes, and, in some cases, produces an 
address register result. For inputs, the store unit uses two 
address register operands and, during actual memory writes, a 
store data register operand- This unit also produces an address 
register result for some store unit operations. For most store 
operations, for example those that write data to memory, the 
store unit produces a physical memory address and the 
associated data bytes to be written. After execution completes, 
these results are entered in a new store queue entry. The store 
queue can hold up to seven data results, each of which can be 64 
bits. 

The store unit has a one-clock execution latency from the time 
it receives address register operands until the time it produces 
an address register result. The most common examples are the 
Load Effective Address (Lea) and Store and Update (Push) 
RISC86 operations, which are produced from the x86 LEA and 
PUSH instructions, respectively. Most store operations do not 
produce an address register result and only perform a memory 
write. The Push operation is unique because it produces an 
address register restdt and performs a memory write. 

The store unit has a one-clock execution latency from the time 
it receives address register operands until it enters the store 
memory address and data pair into the store queue. 

The store unit can have a three-clock latency from the time it 
receives address register operands and a store data register 
operand until it enters the memory address and data pair into 
the store queue. 

Note: Address register operands are required at the start of 
execution, but register store data is not required until the 
end of execution. 

Figure 118 on page 464 shows the architecture of the two-stage 
store execution pipeline. The operation issue and fetch stages 
that precede this execution stage are not part of the execution 
pipeline. The address register operands are received at the end 
of the operand fetch pipe stage, and the new store queue entry 
is created upon completion of the second execution pipe stage. 
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I 1 

I Address Register j- 
I Operands r 



[_ JBase and Index) J 



Store Data Register . 
Operand ' ^ 


Exectitioii Stage 1 

Address Calculation 
Stage 


r T 
J 1 


^1 Address Register Result | 
u J 

Addr?55 


Execution Stage 2 


Data 


► 


i V 




Store Queue Entry 



Figure 1 18. Store Unit Execution Pipefine 

Branch Condition Unit 

The branch condition unit is separate from the branch 

prediction logic, which is utilized at x86 instruction decode 
time. This unit resolves conditional branches, such as JCC and 
LOOP instructions, at a rate of up to one per clock cycle. This 
unit has a dedicated RISC86 issue bus from the scheduler. For 
more information, see "Branch-Prediction Logic" on page 21. 



Floating-Point Unit 

The floating-point tmit (FPU) handles all register operations for 
x87 instructions. The execution unit is a single-stage design 
that takes data register operands as inputs and produces a data 
register result as an output. The most common floating-point 
instructions have a two clock execution latency from the time 
the FPU receives data register operands imtil it produces a data 
register result. The FPU has its own RISC86 issue bus from the 
scheduler. For more information, see Chapter 9, "Floating-Point 
and Multimedia Execution Units'* on page 253. 
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Latencies and Throughput 

Table 86 summarizes the static latencies and throughput of 
each execution unit. Knowing instruction latencies is important 
when figuring critical instruction dependencies. 



Table 86. RISC86 Execution Latencies and Throughput 



Execution 
Unit 


Operations 


Latency 


Throughput 


Register X Integer 
Unit 


Integer ALU 
Integer Multiply 
Integer Shift 


1 

2-3 


1 

2-3 


Register X 
Multimedia Unit 


MMX Add/Subtract 

MMX Logical, Pack, Unpack 


1 


1 


Register Y Integer 
Unit 


Integer ALU (16- and 32- bit operands) 






Register Y 
Multimedia Unit 


MMX Add/Subtract 

MMX Logical, Pack, Unpack 






Multimedia/3D 

Shared Execution 
Units 

(XandY) 


MMXShitter 

MMV3D Multiply, Reciprocal and, Reciprocal Square Root Iteration 

3D Add, Compare, integer G)nver5ion, Rec^)rocal, and Redprocal 
Square Root Table Lookup 


2 
2 




Load 


From Address Register Operands to Data Register Result 

Memory Read Data from Data Cach^tore Queue to Data Register Result 


2 
0 




Store 


From Address Register Operands to Address Register Result 
From Store Data Register Operands to Store Queue Entry 
From Address Register Operands to Store Queue Entry 


1 

1 
3 




Branch 


Resolves Branch Conditions 


1 




FPU 


FADD, FSUB 
FMUL 


2 
2 


2 
2 


Note: 

No additional latency exists between execution of dependent operations Bypassing of reg'ster results directly from producing execution 
units to the operand inputs of dependent units is fully supported. Similarly, forwarding of memory store values from the store queue 
to dependent had operations is supported. 



465 



177AMD0060501 




Resource Constraints 

To optimize code effectively, execution resource constraints 
must be considered. Due to a fixed number of execution units^ 
even with up to six RISC86 operations per cycle, optimal 
execution parallelism should be carefully scheduled. 

For example, if an IMUL is decoded and issued to the X 
pipeline, for the next two to three cycles integer, MMX, and 3D 
RISC86 operations can only be issued to the Y pipeline. 
Another example is two ALU instructions that require the load 
unit. Only one load can occur each cycle, therefore, one 
instruction would stall for a cycle. 

Contention for execution resources can cause delays in the 
issuing and execution of instructions. In addition, stalls due to 
resource constraints can increase dependency latencies to 
cause or exacerbate stalls due to dependencies. In general, 
constraints that delay non-critical instructions do not impact 
performance because such stalls typically overlap with the 
execution of critical operations. 

Code Sample Analysis 

The samples in this section show the execution behavior of 
several series of instructions as a function of decode 
constraints, dependencies, and execution resource constraints. 

Note: These samples are animated in the AMD-KS 3D simulator 
available on the CD-ROM that accompanies this book. 

The sample tables show the x86 instructions, the RISC86 
operation equivalents, the clock counts, and a description of the 
events occurring within the processor. 

The following nomenclature is used to describe the current 
location of a RISC86 operation (RISC86op): 

■ D — Decode stage 

■ Ix — Issue stage of register X unit 

■ Ox — Operand fetch stage of register X unit 

■ Exi — Execution stage 1 of register X unit 
a Ex2 — Execution stage 2 of register X unit 
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T^^ii^ ctfiP^ T^Pi^tpr Y unit 


Ov — 


Onerand fetch ^tapp of reffister Y unit 




'F'YM^iitinn ctsiD^ 1 nf T^oi^t^r Y iitiiI* 




Execution sta^e 2 nf reeistpr Y unit 


It 


TcciiA ctfxiA nf IrkSiH imit 


r\ 

Ol — 


Operand fetch stage of load unit 


Eli - 


Execution stage 1 of load unit 


El2 - 


Execution stage 2 of load unit 


Is - 


Issue stage of store unit 


Os - 


Operand fetch stage of store imit 


Esi — 


Execution stage 1 of store unit 


Es2 — 


Execution stage 2 of store unit 



Note: Instructions execute more efficiently (that is, without 
delays) when scheduled apart by suitable distances based on 
dependencies. In general, the samples in this section show 
poorly scheduled code in order to illustrate the resultant 
effects. 
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Table 87. Sample 1 - integer Register Operations 



Instruction 
Number 


Instruction 


RISC86 

Opcodes 


Clocks 


1 


2 


3 


4 


5 


6 


7 


8 


9 


1 


iMUL tM, cBX 


alux 


D 


D 


Ix 


Ox 












alux 








Ix 


Ox 


Exi 








alux 










Ix 


Ox 


Exi 






2 


INC ESI 


alu 






D 


Iy 


Oy 


Eyi 








3 


MOV EDI,Qx07F4 


limm 






D 














4 


SHL EAX8 


alux 








D 




Ix 


Ox 


Exi 




5 


OR EA)tOxOF 


alu 








D 


Iy 


Oy 


Ix 


Ox 


Exi 


6 


ADD ESI, EDX 


alu 










D 


Iy 


Oy 


Eyi 




7 


SUB EDlEa 


alu 










D 




Iy 


Oy 


Eyi 



Comments for Each instruction Number 



1 It takes two decode cycles because IMUL is vector decoded. The IMUL instruction is executable only in 
the integer X unit. It is a non-pipelined 2-3 cycle latency register operation that is equivalent to three 
serially-dependent regbter operations (the result of the second and third operations are AX and D)Q 
respectively). 

2 This simple alu operation ends up in the Y pipe. 

3 A load immediate (limm) R1SC86 operation does not require execution. The result value is immediately 
available to dependent operations. 

4 Shift instructions are only executable in the integer X unit Issue is delayed by preceding IMUL 
operations due to a resource constraint of the integer X unit. 

5 The register operation is bumped out of the integer Y unit in dock 6 because it must wait for more than 
one cycle for its dependencies to resoh^e. It is reissued in the next cycle to the integer X unit Oust in time 
for availability of its operands). 

6 This add alu falls through to the integer Y unit right behind the first issuance of instruction #5 without 
delay (as a result of instruction #5 being bumped out of the way). 

7 The issuance of the subtract register operation is delayed in dock 6 due to the resource constraints of 
the integer Y unit 
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Table 88. Sample 2 - integer Register and Memory Load Operations 



Instruction 
Number 


Instniction 


RISC86 


Gocks 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


1 


DEC EDX 


3lu 


n 


T 


Ox 


c 
















2 


MOV EDUECX] 


load 


n 
u 


T. 


n. 


















3 


SUB EAX, IEDX+20] 


load 




D 


II 


Ol 


Ell 


Eu 












alu 






Ix 


Ox 


Ix 


Ox 


Exi 










4 


SAR EAX, 5 


dlliX 




D 




Ix 


Ox 


Ix 


Ox 


Exi 








5 


ADD Ea, [EDM] 


load 






D 


II 


Ol 


Eu 


Eu 










alu 








Iy 


Oy 


Iy 


Oy 


Eyi 








6 


AND EBX,OxlF 


alu 






D 




Iy 


Oy 


Eyi 










7 


MOV ESU0X0F1O0] 


load 








D 


II 


Ol 


Ell 


El2 








8 


OR EaiESI+EAX*4+8l 


load 








D 




Ii 


Ol 


Ol 


Ell 


El2 




alu 














Ix 


Ox 


Ix 


Ox 


Exi 



Comments for Each Instruction Number 

1 This simple alu operation ends up in the X pipe. 

2 This operation occupies the load execution unit. 



3 The register operand for the load operation is bypassed, without delay, from the result of instruction #rs 
register operand. In clock 4, the register operation is bumped out of the integer X unit while waiting for the 
previous toad operation result to complete. It is reissued just in time to receive the bypassed result of the load. 

4 Shift instructions are only executable in the integer X unit The register operation is bumped in clock 5 while 
waiting for the result of the preceding instruction #3. 

5 The register operand for the load operation is bypassed, without delay, from the result of instruction #2's 
re^er operand. This and most surrounding load operations are generated by instruction decoders, and issued 
and smoothly executed by the toad unit at a rate of one dock per cycle. In dock 5, the register operation is 
bumped out of the integer Y unit while waiting for the previous load operation result to complete. 

6 The register operation falls through into the integer Y unit right behind instruction #5's register operation. 

7 This operation falls into the load unit behind the load in instruction #5. 

8 The operand fetch for the load operation is delayed because it needs the result of the immediately preceding 
load operation #7 as well as tiie results from earlier instructions #3 and #4. 
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Table 89. Sample 3 - Integer Register and Memory Load/Store Operations 



Instruction 
Number 


Instruction 


RISC8S 
Opoxles 


Clocks 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


1 


MOV EOX, 
[OxAOOOSFOO] 


load 


D 


II 


Oi 


Eli 


Eu 














2 


ADD IEDX+16],7 


load 




D 


II 


Ol 


Ol 


Eli 


El2 










ahi 






Ix 


Ox 


Ix 


Ox 


Ox 


^1 








store 






Is 


Os 


Os 


Esi 


ES2 


ES2 








3 


SUB EAX, [EDXi-16] 


load 






D 


II 


II 


Ol 


Ell 


El2 


El2 






alu 








Ix 


Ox 


Ix 


Ix 


Ox 


Ox 


Exi 




4 


PUSH EAX 


store 






D 


's 


'$ 


Os 


Esi 


ES2 


ES2 


ES2 




5 


LEA EBX, 
[Ea+EAX*4+3] 


store 








D 




Is 


Os 


Os 


Os 


Esi 


ES2 


6 


MOV EDI, EBX 


alu 








D 


ly 


Oy 


h 


Or 


Ix 


Ox 


Exi 



Comments for Each Instruction Number 



1 This operation occupies the load unh. 

2 This long-decoded ADD instruction takes a single clock to decode. The operand fetch for the load operation is 
delayed waiting for the result of the previous load operation from instruction #1. The store operation 
completes concurrent with the register operation. The result of the register operation is bypassed directiy into a 
new store queue entry aeated by the store operation. 

3 The issue of the load operation is delayed because the operand fetch of the preceding load operation from 
instruction #2 was delayed. The completion of the load operation is held up due to a memory dependency on 
the preceding store operation of instruction #2. The load operation completes immediately after the store 
operation, with the store data being forwarded from a new store queue entry. 

4 Completion of the store operation is held up due to a data dependency on the preceding instruction #3. The 
store data is bypassed directly into a new store queue entry from the resu^ of instruction #3's register 
operation. 

5 The Lea RISC86 operation is executed by the store unit The operand fetch is delayed waiting for the result of 
instruction #3. The register result value is produced In the first execution stage of the store unit 

6 This simple alu operation is stalled due to the dependency of the 6X result in instruction #5. 
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Table 90. Sample 4 - Integer, MMX and Memory Loa4/Store Operations 



Inst. 
Num. 


insiiuciiaii 


RISC86 


Cfodts 


1 


1 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


1 


PADDSW MMO. MM4 




U 


T 

h 


Ox 


c 


















2 


PADDSW MM1, liAMS 




V 


T 




r 

tyi 


















3 


PSRAW MMO. 3 


alu 




D 


T 

h 


Ox 


r 

txi 
















4 


MOVO MM2 
[EAX+EBX] 


mlodd 




U 


T 

h 




c 

HI 
















5 


PAND MMO, MM3 


alu 






D 


Ix 


Ox 


hi 














6 


PMULLW MM2JEDI4-8] 


mtoad 






D 


k 


Oi 


Ell 


El2 
















alu 








h 


Ov 


Ix 


Ox 


Exi 


Ex2 








7 


MOVQ [ESP+41MM2 


mstore 








D 


's 


Os 


Esi 


ES2 


ES2 








8 


ADD EBX, Ea 


alu 








D 


Ix 


Ox 


Exi 












9 


PMULLW MM6. MM7 


alu 










D 


k 


Ov 


E^i 


En 


En 






10 


PMADDWDMM2, MM6 


alu 










D 




Ix 


Ox 


Ox 


Ox 


Exi 


EX2 



Comments for Each Instruction Number 

1, 2 Instructions 1 and 2 are decoded, issued, and executed simultaneously and in parallel due to no decode 
restrictions, dependency delays, or execution resource constraints. 

3 This instruction Is decoded, issued, and executed without delay, one cyde behind the preceding one-cycle 
execution latency instruction on which it is dependent 



4 This multimedia operation occupies the load unit. 

5 This Instruction is decoded, issued, and executed without delay, right behind the preceding operations on 
which it is dependent. 

6 This and the preceding instruction are decoded and issued together without delay. The operand fetch of the 
regBter operation is delayed because of the dependerKy on the associated load. As a result, the register 
operation is bumped out of register unit Y in dock 5 and is reissued in the next cycle to register unit X (as it 
happens), just in time for availability of its operands. 

7 Completion of this store operation is held up due to a data dependency on the preceding MMX multiply 
register operation (which has a two-cycle execution latency). The store data is bypassed directly into a new 
store queue entry from the result of the register operation. 

8 This operation is issued to reglsXer unit X and executes without delay and out-of-order with respect to the 
preceding register operation from instrudion #6 (which was bumped out of the way while waiting for its 
operands). 

9 This MMX multiply register operation issues to and starts execution in register unit Y in parallel with an MMX 
multiply register operation from instruction #6 which simultaneously issues to and starts execution in register 
unit X Due to an execution resource constraint this operation is delayed one cyde in its first execution pipe 
stage and then executes and completes normally, one cyde behind the other contending register operation. 
(This takes advantage of the pipelined nature of the MMX multiply execution logic.) 

10 The Issue of this operation is delayed (in dock 6) for one cycle due to two earlier re^'ster operations being 
selected for issue. It is then delayed hirther during operand fetch while waiting for the preceding two<yde 
latency MMX multiply register operations to complete execution. 
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Optimization Coding Guidelines 



General x86 Optimization Teclinlques 

This section describes general code optimization techniques 
specific to superscalar processors (that is, techniques common 
to the AMD-K6 3D processor, AMD-K5"^ processor, and 
Pentium-family processors). In general, all optimization 
techniques used for the AMD-K5 processor, Pentium, and 
Pentium Pro processors either improve the performance of the 
AMD-K6 3D processor or are not required and have a neutral 
effect (usually due to fewer coding restrictions with the 
AMD-K6 3D processor). 

Short Forms — Use shorter forms of instructions to increase the 
effective number of instructions that can be examined for 
decoding at any one time. Use 8-bit displacements and jump 

offsets where possible. 

Simple Instructions — Use simple instructions with hardwired 
decode (pairable, short, or fast) because they perform more 
efficiently. This includes "registers- register op memory" as 
well as "registers- register op register" forms of instructions. 

Dependencies — Spread out true dependencies to increase the 
opportunities for parallel execution. Anti-dependencies and 
output dependencies do not impact performance. 

Memory Operands — Instructions that operate on data in 
memory (load/operation/store) can inhibit parallelism. The use 
of separate move and ALU instructions allows better code 
scheduling for independent operations. However, if there are no 
opportunities for parallel execution, use the 
load/operation/store forms to reduce the number of register 
spills (storing values in memory to free registers for other uses). 

Register Operands — Maintain frequently used values in 
registers rather than in memory. 

Stack References — Use ESP for stack references so that EBP 
remains available. 
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Stack Allocation — When allocating space for local variables 
and/or outgoing parameters within a procedure, adjust the 
stack pointer and use moves rather than pushes. This mediod of 
allocation allows random access to the outgoing parameters so 
that they can be set up when they are calculated instead of 
being held somewhere else until the procedure call. This 
method also reduces ESP dependencies and uses fewer 
execution resources. 

Data Embedding — When data is embedded in the code 

segment, align it in separate cache blocks from nearby code. 
This technique avoids some overhead when maintaining 
coherency between the instruction and data caches. 

Loops — Unroll loops to get more parallelism and reduce loop 
overhead, even with branch prediction. Inline small routines to 
avoid procedure-call overhead. For both techniques, however, 
consider the cost of possible increased register usage, which 
might add load/store instructions for register spilling. Unrolling 
large code loops can result in the inefficient use of LI 
instruction caches. 

Code Alignment — Aligning subroutines at O-mod-16 (or ideally, 
at O-mod-32) address boundaries optimizes instruction 
cache-fill efficiency. Keeping the starting point of loops at least 
two instructions away from the end of 32-byte cache lines 
optimizes branch-target instruction fetch and decode 
efficiency. 
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General AMD-K6 3D Processor x86 Coding Optimiiations 

This section describes general code optimization techniques 
specific to the AMD-K6 processor models 6, 7, 8^ and 9. 

Use short-decodeable instructions — To increase decode 
bandwidth and minimize the number of RISC86 operations per 
x86 instruction, use short-decodeable x86 instructions. See 
instructions labeled as 'short' in the 'Decode Type' column in 
Tables 12 through 15 starting on page 53. 

Pair short-decodeable instructions — Two short-decodeable x86 
instructions can be decoded per clock» using the full decode 
bandwidth of the processor. 

Note: For the AMD-K6 3D processor, all MMX and 3D instructions 
are short-decodeable except the EMMS, FEMMS, and 
PREFETCH instructions. 

Avoid using complex instructions — The more complex and 
uncommon instructions are vector decoded and can generate a 
larger ratio of RISC86 operations per x86 instruction compared 
with short-decodeable or long-decodeable instructions. 

Avoid multiple and accumulated prefixes — In order to 
accomplish an instruction decode, the decoders require 
sufficient predecode information. When an instruction has 
multiple prefixes and this cannot be deduced by the decoders 
(due to a lack of data in the instruction decode buffer), the first 
decoder retires and accumulates one prefix per cycle until the 
instruction is completely decoded. Table 91 on page 475 shows 
when prefixes are accumulated and decoding is serialized. 
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Table 91. Decode Accuinulation and Serialixation 



Decode #1 


Decoder #2 


Results 


Instruction 




Single instruction decoded. 


Instruction 


Instruction 


Dual instruction decode. 


Instruction 


Prefa 


Single instruction decode and prefix is 
accumulated. 


Prefix 


Instruction 
(modified by Prefix) 


No prefix accumulation and single instruction 
is decoded. 


PrefixA 


PrefbcB 


Accumulate PrefixA and cancel decode of the 
second prefix. 


PrefixB 


Instruction 


If a prefix has already been accumulated in 
the previous decode cycle, accumulate PrefixB 
and cancel instruction decode, wait for next 
decode cyde to decode the instruction. 



OFh prefix usage — OFh does not count as a prefix for the 
decoder accumulation rules (that is, it does not cause 
accumulation). 



Avoid long instruction length — Use x86 instructions that are 
less than eight bytes in length. An x86 instruction that is longer 
than seven bytes cannot be short-decoded. 

Use read-modify-write instructions over discrete equivalent— 
No advantage is gained by splitting read-modify-write 
instructions into a load-execute-store instruction group. Both 
read-modify-write instructions and load-execute-store 
instruction groups decode and execute in one cycle but 
read-modify-write instructions promote better code density. 

Move rarely used code and data to separate pages — Placing 
code, such as exception handlers, in separate pages and data, 
such as error text messages, in separate pages maximizes the 
use of the TLBs and prevents table pollution with rarely used 
items. 

Avoid mixing code size types — Size prefixes that affect the 
length of an instruction can sometimes inhibit dual decoding. 

Always pair CALL and RETURN— If CALLs and RETs are not 
paired, the return address stack gets out of synchronization, 
increasing the latency of returns and decreasing performance. 
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Exploit parallel execution of integer and floating-point 

multiplies — The AMD-K6 3D processor allows simultaneous 
integer and floating-point multiplies using separate, 
low-latency multipliers. 

Avoid more than 16 levels of nesting in subroutines — More than 
16 levels of nested subroutine calls overflow the return address 
stack, leading to lower performance. While this is not a problem 
for most code, recursive subroutines might easily exceed 16 
levels of subroutine calls. If the recursive subroutine is tail 
recursive, it can usually be mechanically transformed into an 
iterative version, which leads to increased performance. 

Place frequently used stack data within 128 bytes of the EBP — 
The statically most-referenced data items in a function's stack 
frame should be located from -128 to +127 bytes from EBP. This 
technique improves code density by enabling the use of an 8-bit 
sign-extended displacement instead of a 32-bit displacement. 

Avoid superset dependencies — Using the larger form of a 
register immediate after an instruction uses the smaller form 
creates a superset dependency and prevents parallel execution. 
For example, avoid the following type of code: 

OR AH,07h 

ADD EAX.1555555h 

One method for avoiding superset dependencies is to schedule 
the instruction with the superset dependency (for example, the 
ADD instruction) 4-6 instructions later than would normally be 
preferable. Another method, useful in some cases, is to use the 
MOVZX instruction to efficiently convert a byte-size value to a 
doubleword-size value, which can then be combined with other 
values in 32-bit operations. 

Avoid excessive loop unrolling or code inlining — Excessive loop 
unrolling or code inlining increases code size and reduces 
locality, which leads to lower cache hit rates and reduced 
performance. 

Avoid splitting a 16-bit memory access in 32-bit code — No 
advantage is gained by splitting a 16-bit memory access in 
32-bit code into two byte-sized accesses. This technique avoids 
the operand size override. 
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Avoid data dependent branches around a single instruction — 

Data dependent branches acting upon basically random data 
cause the branch prediction logic to mispredict the branch 
about 50% of the time. Design branch-free alternative code 
sequences. The effect Is shorter average execution time. The 
following examples illustrate this concept: 

■ Signed integer ABS function (x = labs(x)) 
Static Latency: 4 cycles 

MOVECX, [X] jload value 

MOVEBX. ECX 

SARECX. 31 

XOREBX, ECX ;l's complement if x<C. else don't modify 

SUBEBX, ECX :2's complement if x<0, else don't modify 

MOVCx], EBX ;save labs result 

Unsigned integer min function (z = x < y ? x : y) 
Static Latency: 4 cycles 
MOVEAX, [x] ;load x value 
MOVEBX. [y] :load y value 

SUBEAX, EBX ;set carry flag if y 's greater than x 
SBBECX. ECX ;get borrow out from previous SUB 
ANDECX, EAX ;if x > y. ECX = x-y, else 0 
ADDECX. EBX :if x > y. return x-y-^y = x, else y 
MOVEz], ECX :save min (x.y) 

■ Hexadecimal to ASCII conversion 
(y=x < 10 ? X + 0x30: x + 0x41) 
Static Latency: 4 cycles 

MOVAL. [x] :load x value 

CMPAL, 10 :if x is less than LG, set carry flag 

SBBAL, 69h ;0..9 -> 96h, Ah..Fh -> Alh...A6h 

DAS :0..9: subtract 65h, Ah..Fh: Subtract 60h 

MOVCyD.AL ;save conversion in y 

Avoid using the [ESI] addressing mode — This addressing mode 
forces the instructions using it to become vector decoded. There 
are two ways to avoid this problem. The first way is to use 
another register The second way is to alter the addressing mode 
by explicitly coding [ESI+0]. Assemblers may optimize this to 
[ESI] by removing the 0. 
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AMD-K6 3D Processor Integer x86 Coding Optimiiations 

This section describes integer code optimization techniques 
specific to the AMD-K6 processor models 6, 7, 8, and 9. 

Neutral code filler— Use the XCHG EAX, EAX or NOP 

instruction when aligning instructions. XCHG EAX, EAX 
consumes one decode slot but requires no execution resources. 
Essentially, the scheduler absorbs the equivalent RISC86 
operation without requiring any of the execution units. 

Inline REP String with low counts— Expand REP String 
instructions into equivalent sequences of simple x86 
instructions. This technique eliminates the setup overhead of 
these instructions and increases instruction throughput. 

Use ADD reg, reg instead of SHL reg, 1 — This optimization 
technique allows the scheduler to use either of the two integer 
adders rather than the single shifter and effectively increases 
overall throughput. The only difference between these two 
instructions is the setting of the AF flag. 

Use MOVZX and MOVSX to zero-extend and sign-extend 
byte-size and word-size operands to doubleword length — For 
example, typical code for zero extension creates a superset 
dependency when the zero-extended value is used, as in the 
following code: 

XOR EAX, EAX 
MOV AL. [mcmj 

Instead, use the following code: 

MOVZX EAX. BYTE PTR Lmem] 

Use load-execute integer instructions — Most load-execute 
integer instructions are short-decodeable and can be decoded 
at the rate of two per cycle. Splitting a load-execute instruction 
into two separate instructions — a load instruction and a reg, reg 
instruction — reduces decoding bandwidth and increases 
register pressure. The split-instruction form can be used to 
avoid scheduler stalls for longer executing instructions and to 
explicitly schedule the load and execute operations. 
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Use AL» AX, and EAX to improve code density — In many cases, 
instructions using AL and EAX can be encoded in one less byte 
than using a general-purpose register. For example, ADD AX, 
0x5555 should be encoded 05 55 55 and not 81 CO 55 55. 

Clear registers using MOV reg, 0 instead of XOR reg, reg — 
Executing XOR reg, reg requires additional overhead due to 
register dependency checking and flag generation. Using MOV 
reg, 0 produces a limm (load immediate) RISC86 operation that 
is completed when placed in the scheduler and does not 
consume execution resources. 

Use 8-bit sign-extended immediates — Using 8-bit 
sign-extended immediates improves code density with no 
negative effects on the processor. For example, ADD BX, -55 
should be encoded 83 C3 FB and not 81 C3 FF FB. 

Use 8-bit sign-extended displacements for conditional 

branches — Using short, 8-bit sign-extended displacements for 
conditional branches improves code density with no negative 
effects on the processor. 

Use integer multiply over shift-add sequences when it is 
advantageous — The AMD-K6 3D processor features a 
low-latency integer multiplier. Therefore, almost any shift-add 
sequences can have higher latency than MUL or IMUL 
instructions. An exception is a trivial case involving 
multiplication by powers of two by means of left shifts. In 
general, replacements should be made if the shift-add 
sequences have a latency greater than or equal to 3 clocks. 

Carefully choose the best method for pushing memory data — 
To reduce register pressure and code dependency, use PUSH 
[mem] rather than MOV EAX, [mem], PUSH EAX. 

Balance the use of CWD, CBW, CDQ, and CWDE— These 
instructions require special attention to avoid either decreased 
decode or execution bandwidth. The following code illustrates 
the possible trade-offs: 

■ The following code replacement trades decode bandwidth 
(CWD is vector decoded, but with only one RISC86 
operation) with execution bandwidth (SAR requires two 
RISC86 operations, including a shift): 

Replace:CWD With: MOV DX.AX 
SAR OX, 15 
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■ The following code replacement improves decode 
bandwidth (CBW is vector decoded while MOVSX is short 

decoded): 

Replace:CBW With: MOVSX AX.AL 

■ The following code replacement trades decode bandwidth 
(CDQ is vector decoded, but with only two RISC86 
operations) with execution bandwidth (SAR requires two 
RISC86 operations, including a shifter): 

Replace:CDO With: MOV EDX.FAX 
SAR EDX,31 

■ The following code replacement improves decode 
bandwidth (CWDE is vector decoded while MOVSX is short 
decoded): 

Replece:CWDEWith: MOVSX EAX. AX 

Replace integer division by constants with multiplication by 
the reciprocal — This optimization is commonly used on RISC 
processors. Because the AMD-K6 3D processor has an extremely 
fast integer multiply (two cycles) and the integer division 
delivers only two bits of quotient per cycle (approximately 18 
cycles for 32-bit divides), the equivalent code is much faster. 
The following examples illustrate the use of integer division by 
constants: 

■ Unsigned division by 10 using multiplication by reciprocal 

Static Latency: 5 cycles 

; 1N:(:AX = dividend 
; OUTrEDX = quotient 

MOVEOX, 0CCCCCCCnh:0.1 * 2^32 * 8 rounded up 
MULEOX 

SHREDX. 3 :divide by 2^32 * 8 

■ Unsigned division by 3 using multiplication by reciprocal 
Static Latency: 5 cycles 

; IN:EAX = dividend 
; OUT:EDX = quotient 

MOVEDX. 0AAAAAAABh:l/3 * 2'^32 - 2 rounded up 
MULEDX 

SHREDX. 1 ;div1de by 2'^32 * 2 

■ Signed division by 2 
Static Latency: 3 cycles 

; INrEAX = dividend 
: OUTiEAX - quotient 

CMPEAX, 800000000h;CY = 1. if dividend >-0 
SBBEAX. -I ; increment dividend if it is <0 

SAREAX. 1 :perform a right shift 
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Signed division by 2^n 
Static Latency: 5 cycles 

; IN:EAX = dividend 
: OUT:EAX = quotient 



MOVEOX. EAX 
SAREDX. 31 
ANDEDX, 
ADDEAX, EDX 

SAREAX, (n) 



;sign extend Into EDX 
;EDX - OxFFFFFFFP if dividend < 0 
;mask correction (use divisor -1) 
;apply correction if necessary 
; perform right shift by log2 (divisor) 

Signed division by -2 
Static Latency: 4 cycles 

: IN:EAX =» dividend 
; OUT:EAX = quotient 

CMPEAX. 800000000h;CY =1. if dividend >=0 
SBBEAX. -1 increment dividend if it is <0 

SAREAX. 1 :perform right shift 

NtGEAX ;use (x/-2) = = - (x/2) 

Signed division by -i2^n) 
Static Latency: 6 cycles 

: IN:EAX = dividend 
; OU'^rEAX = quotient 

MOVEDX. EAX :sign extend into EDX 

SAREDX. 31 :EDX = OxFFFFFFFF if dividend < 0 

ANDEDX, (2''p-l) :frask correction (-divisor -1) 
ADDEAX, EDX ;apply correction if necessary 

SAREAX. (n) ;right shift by log2(-di visor) 

NEGEAX :use (x/-(2'^n)) = (- Uf2^r\)) 

Remainder of signed integer 2 or (-2) 
Static Latency: 4 cycles 

: IN: EAX - d-'vidend 
: OUTrEDX = quotient 

MOVEDX. EAX ;sign extend into EDX 

SARFDX. 31 ;FDX = OxFFFFFFFF if dividend < 0 

ANDEDX. 1 ;compute remainder 

XOREAX. EDX ;negat9 remainder If 

SUBEAX, EDX :d1vidend was < 0 

MOVCquotient], EAX 

Remainder of signed integer (2'^n) or (-(2'^n))) 
Static Latency: 6 cycles 

: 1N:EAX = dividend 
: OUT:EDX - quotient 

MOVEDX. EAX ;sign extend "into EDX 

SAREDX. 31 :EDX = OxFFFFFFFF if dividend < 0 

ANDEDX. (2^n-1) ;ni3sk correction ( abs(di vi son)-l ) 
ADDEAX. EDX :app1y pre-corrcction 

;mask out remainder (abs(di vison)-l ) 
;apply pre-correction if necessary 



ANDEAX. (2'^n-l) 
SUBEAX. EDX 
MOVCquotient], EAX 
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AMD-K6 3D Processor Multimedia Coding Optimizations 

This section describes multimedia code optimization 
techniques for the AMD-K6 3D processor. 

For optimal floating-point performance — Wherever possible, 
use the packed single-precision, floating-point capability of 3D 
technology instead of the single-precision, double-precision, 
and extended-precision floating-point capabilities of the x87 
floating-point unit. The 3D units are fully pipelined, allow 
vectorized optimizations, are not stack based, and provide 
faster inverse, square root, and inverse square root calculations. 

Issues to ensure optimal predecode of MMX and 3D 
instructions — Attention must be paid to coding issues that can 
inhibit the predecode, and later dual decode, of x86 
instructions. Instructions are predecoded during instruction 
cache line fills. The predecode information that is produced 
and then stored in the predecode cache is later used by the 
instruction decoders to quickly find consecutive instructions 
and, therefore, enable dual-instruction decode. (The predecode 
information, in particular, reflects the length of instructions.) 

The processor predecode scheme is based on a number of 

assumptions and constraints that have been mentioned 
previously, but which are repeated here for convenience: 

■ Only a subset of x86 instructions are short decodeable and 

require predecode information. These include all MMX and 
3B instructions except for the EMMS, FEMMS, and 
PREFETCH instructions. 

■ Predecodeable instructions can be up to seven bytes in 
length. 

■ The processor predecoders can only examine the first three 
bytes of an instruction to determine the length of the 
instruction and generate the predecode information. To 
determine instruction length, non-modR/M instructions 
require examination of the opcode byte, and modR/M 
instructions require the examination of the opcode byte 
plus the modR/M byte. Instructions with a OFh prefix 
require the examination of the OFh byte in addition to the 

r opcode byte and any modR/M byte. Finally, modR/M 

address modes with a sib byte and no displacement 
(modR/M = 00_xxx_100b) require examination of the 
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additional sib byte. Instructions in this last category that 
also require a OFh prefix violate the three-byte predecode 
constraint and, therefore, cannot be predecoded — these 
instructions use either [disp32 + (scaled)index] or [base + 
(scaled)index] address modes and, therefore, require the 
examination of four bytes to determine instruction length. 

■ The 32-bit modR/M address mode [ESI] cannot be 
predecoded. 

■ For instructions starting within the last two bytes of a cache 
line, the predecode logic is not able to scan past the end of 
the cache line when it needs to examine more bytes to 
determine the length of an instruction. This constraint 
limits the type of instructions that can be predecoded at the 
end of a cache line. For example, a modR/M instruction that 
starts on the last byte of a 32-byte cache line, or a OFh-prefix 
plus modR/M instruction that starts within the last two 
bytes of the cache line, cannot be predecoded. 

■ MMX and 3D instructions have a OFh-prefix byte, an opcode 
byte, and a modR/M byte, all of which must be examined by 
the predecode logic. 

These constraints result in the following recommendations for 
successful predecode of multimedia instructions: 

■ With 3D instructions, do not use address modes with large 
(32-bit) displacements. Large displacements result in a total 
instruction length of eight bytes (including the additional 
suffix byte used at the end of the instruction as a 
sub-opcode byte). 

■ With MMX and 3D instructions, do not use the [disp32 + 
(scaled)index], [base + (scaled)index], or [ESI] address 
modes. 

■ Avoid placing the start of MMX and 3D instructions in the 
last two bytes of a cache line. If not successfully 
predecoded, MMX instructions default to vector decodes 
and 3D instructions default to long decodes. 

A comparison of the instruction decode clock-cycle count on 
optimized code is as follows: 

• 0.5 cycle for one short decode as part of a dual decode. 

• 1.0 cycle for a single long decode. 

• 2.0 cycles for a single vector decode (for simple 
instructions such as MMX and 3D instructions). 
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Avoid using MMX/BD registers to move double-precision 
floating-point data— Although using an MMX/3D register to 
move x87 floating-point data appears fast, using these registers 
requires the use of the EMMS or FEMMS instruction when 
switching from MMX or 3D instructions to x87 instructions. 

Use the FEMMS instruction instead of the EMMS 
instruction — The AMD-K6 3D processor implements an 
improved version of the EMMS instruction, called FEMMS. 
Because the MMX/3D registers are mapped onto the x87 stack, 
an EMMS or FEMMS instruction must be executed when 
switching from MMX or 3D code to x87 code. Execution of the 
EMMS or FEMMS instruction marks the floating-point tag word 
as empty (all I's), which guarantees correct x87 results and 
ensures that no x87 exceptions occur in the subsequent code 
due to a stack overflow. 

Each time the processor encounters a switch between MMX or 
3D code and x87 code, in either direction, a significant 
clock-cycle count penalty occurs. The FEMMS instruction was 
created to reduce this penalty. The FEMMS instruction sets the 
floating-point tag word to empty (like EMMS), and also sets all 
of the register values as undefined. If a switch is required 
following a FEMMS instruction, it executes in less than half the 
cycles required after an EMMS instruction. The switch 
overhead occurs when an x87 instruction is encountered, and 
not during the execution of the EMMS and FEMMS 
instructions. In addition, the FEMMS instruction executes in 3 
clock cycles, 2 cycles less than the EMMS instruction. For more 
information on the operation and advantages of the FEMMS 
instruction, see Chapter 4, "3D Technology" on page 81. 

Use the FEMMS instruction at the beginning of an MMX or 3D 
routine — While the FEMMS instruction is not necessary for 
correct program functionality at the beginning of MMX or 3D 
routines, its usage reduces the clock-cycle count penalty when 
entering such routines from preceding x87 code. If no switch 
occurs, the FEMMS takes 3 clock cycles to execute. If a switch is 
necessary, FEMMS reduces the clock cycles required by over 
half. 
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Practice the folIoM^ing general rules ivhen using MMX or 3D 
code mixed with x87 code: 

■ Always use the FEMMS instruction (instead of EMMS) at 
the end of an MMX or 3D routine when x87 instructions or 
unknown code follows. 

■ Use the FEMMS instruction at the beginning of an MMX or 
3D routine that is preceded by x87 instructions or unknown 
code. FEMMS serves to reduce any switch penalty. 

■ Group or partition MMX or 3D code separate from x87 code 
to nunimize the frequency of switching between MMX or 3D 
operations and x87 operations. 

Use the new 3D instruction PAVGUSB instruction for MPEG-2 
motion compensation — In DVD decoding, motion 
compensation performs a lot of byte averaging between and 
within macroblocks. The PAVGUSB instruction helps speed up 
these operations. In addition, PAVGUSB can free up some 
registers and make unrolling the averaging loops possible. 
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The following code fragment uses original MMX code to 
perform averaging between the source macroblock and 
destination macroblock: 

mov esi, DWORD PTR Src_MB 

mov edi, DWORD PTR Dst^MB 

mov edx, DWORD PTR SrcStride 

mov ebx, DWORD PTR DstStride 

movq mm?. QWORD PTR [ConstFEFE] 

movq mm6, QWORD PTR [ConstOlOl] 

mov ecx, 16 



1 1 • 
Li . 








movq 


mmO, 


[esi] 


; mmu— qworu i 


movq 


mml , 


[edi] 


• mm T —rt uifw^ei'X 

fHiiiii — qwur uo 


nio vq 


mm 2 . 


mmO 




movq 


mm3. 


mml 




p3nd 


mni2 , 


mnn6 




pand 


mm3 , 


mm6 




psnd 


mmO, 


mm? 


, fTllllU L|WU 1 U i a UA 1 C 1 C 1 C 1 c 


pand 


mml . 


mm7 


:mml = qword3 & Oxfefefefe 


por 


mm2 , 


mm 3 


icalculate adjustment 


psrl q 


mmO. 


1 


, inmu vqwuiui a uAicicicic^/c 


ps rl C] 


mml , 


1 


,1111111 Vv^WUItlJ Ot UAICICICIC//£. 


n A n H 


mm2. 


mm6 




paddb 


mmO. 


mml 


;mmO - qwl/2 + qw3/2 w/o adjust- 








; ment 


paddb 


mmO» 


mm 2 


:add Isb adjustment 


movq 


[edi ] , mmO 




movq 


mm4. 


[esi+8] 


;mm4«qword2 


movq 


mmS, 


Ced1+8] 


;mm5=-qword4 


movq 


mni2. 


mm 4 




movq 


mni3. 


mmS 




pand 


mm2. 


mni6 




pand 


mm3. 


mm6 




pand 


mm4 , 


mm? 


;mniO = qword2 & Oxfefefefe 


pand 




mm? 


:mml = qword4 & Oxfefefefe 


por 


(rm2. 


mm 3 


;calcu1ate adjustment 


psrl q 


n"m4 , 


1 


:mmO = (qword2 & Oxf efef ef e) /2 


psrlq 


rrmS . 


1 


:mml = (qword4 & 0xfefefefe)/2 


pond 


nr.m2. 


mm6 




paddb 


mm4. 


mm 5 


;mraO = qw2/2 + qw4/2 w/o adjust- 








: ment 


paddb 


nim4. 


mm 2 


;add Isb adjustment 


movq 


[edi+8]. mm4 




add 


esi , 


edx 




add 


edi . 


ebx 




loop 


LI 
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3D Matrix 
Multiplication 
Optimization 
Example 



The following code fragment uses the 3D PAVGUSB instruction 
to perform averaging between the source macroblock and 
destination macroblock: 



mov 


eax, 


DWORD PTR 


mov 


edi . 


DWORD PTR 


mov 


edx , 


DWORD PTR 


mov 


ebx. 


DWORD PTR 


mov 


ecx. 


16 


LI: 






movq 


mmO. 


[eax] 


movq 


mml . 


[eax+8] 


pavgusb 


romO, 


[edi] 


pdvgusb 


mml , 


[edi+8] 


add 


eax. 


edx 


movq 


[edi], mmO 


movq 


[edi+8], rami 


add 


edi . 


ebx 


loop 


LI 





:(nmO-qwordl 
;mml=qword2 

:(qwl+qw3)/2 with adjustment 
: (qw2-*-qw4)/2 with adjustment 



The code samples starting on page 488 contain both a 
non-optimized and an optimized sample of a 4x4 matrix 
multiplied by a 4x1 vector. This type of code is often used in 3D 
graphics for geometry transformation. This routine serves to 
translate, scale, rotate, and apply perspective to 3D coordinates 
represented in homogeneous coordinates. The code samples 
contain many addition and multiplication instructions that can 
now be implemented in any one of three ways. For high-end, 3D 
graphic programs, x87 FPU instructions supply only moderate 
performance, are not superscalar, and cannot be efficiently 
intermixed with MMX and 3D instructions. Integer instructions 
and MMX instructions, while fast and superscalar, do not have 
the accuracy and dynamic range that is required for these 
programs. Therefore, the 3D instructions, providing the benefit 
of packed, floating-point data precision and parallel execution, 
can be used in order to write software that outperforms 
standard floating-point code and has no switching overhead 
when intermixed with MMX code. The following two code 
samples illustrate non-optimized and optimized code. A 
description of the steps a programmer should take when 
optimizing code for the AMD-K6 3D processor starts on 
page 493. 
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Non-Optimized Code Sample: 



; void Transforni4x4(Vertex *firstv, int cnt, const Matrix *m) 

: NON OPTIMIZED VERSION 

: Fun 4x4 matrix transform of an array of cnt vertices starting from the 

; vertex pointed to by firstv, using tne Lransform matrix pointed to by m. 

; Each vertex data structure is assumed to occupy 128 bytes. 16 bytes of 

; which contains the vertex coordinates to be transformed. 

: new_x - x'^mLOlEO] + y*m[03[l] + z-m[0][23 + w*m[0][3]: 
; new_y «- x*ni[l][0] + y^m[l][l] ^ z-m[l][2] + w*^m[l][3]: 
; new_2 « x*ra[2][0] + y*in[2][]] + z-m[2][2] + w*m[2][3]: 
; new_vf = x*m[33[0] + y*m[3][l] + z-mC3][2] + w*m[3][3]; 



V-tx X 


equ 


Oh 


Vrtx^Y 


equ 


4h 


Vrtx^Z 


equ 


8h 


Vrtx_W 


equ 


Och 


Mat_00 


equ 


Oh 


Mat 01 


equ 


4h 


Mat_02 


equ 


8h 


Mat_03 


equ 


Och 


Mat_10 


equ 


lOh 


Mat_ll 


equ 


14h 


Mat_12 


equ 


18h 


Mat_13 


equ 


Ich 


Mat_20 


equ 


20h 


Mat 21 


equ 


24h 


Mat_22 


equ 


28h 


Mat_23 


equ 


2ch 


Mat_30 


equ 


30h 



Mat_31 equ 34h 
Mat_32 equ 38h 
Mat_33 equ 3ch 

; EAX = m ptr 
:EBX = firstv ptr 
;EDX - lastv ptr 

Comments appear after the code lines, 

Transf ormLoop: 

:An mulfplies for XResult: 
movq mmO. QWORD PTR [ebx + Vrtx_XJ ;nimO = y x 
inovq mm2, mmO ;copy vector 

Right in the beginning there is a dependency for mmO, which stalls the second movq 2 dock cycles even though 
both instructions are short-decodeable and decode together as an instruction pair. 
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pfmul mmO. QWORD PTR [cax + Mat_00] ;mmO = y*a2] | x*all 
The PFMUL instruction leads to another dependency, but because of the previous stall, the PFMUL instruction 
executes wften Matj)0 foods from memory. The PFMUL instruction translates to on 3D ALU and a Load unit 
operation. 

movq mini, QWORD PTR [ebx + Vrtx^Z] ;niml = w | z 
The MOVQ instruction decodes with the previous pfmul but there is now a resource constraint vinth both 
instwaions trying to use the Load unit This contention causes one of the instructions to stall an extra cycle 

movq mni3, rami ;copy vector 

Another stall Me waiting for mm I 

pfmul mml. QWORD PTR [eax + Mat_20] :tnml = w*a41 | 2*a31 
Same as the previous PFMUL instruction. Note that tasks in this code line are serialized, with no opportunity for 
overlap of execution resources. Even if the instructions short decode in pairs, other constraints are causing stalls In 
addition, a scheduler stall occurs when an instruction cannot retire off the bottom of the scheduler because 
dependency and resource stalls have delayed the instruction too many cycles. 

;A11 multiplies for YResult: 
movq mmA, mm2 ;copy vector 

pfmul mm2, QWORD PTR [eax + Mat_01] ;inm2 = y*a22 | x*al2 

These instructions are paired The PFMUL instmctions decode to a Load unit operation followed by an 3D Multiply 

unit operation. 

movq mm5, niin3 :copy vector 

pfraul mm3, QWORD PTR [eax + M3t_21] :mm3 - w*a42 | z*a32 
These instructions are paired. Same comments as before. 

:A11 multiplies for ZResult: 
movq mm6, mm4 ;copy vector 

pfmul mm4. QWORD PTR [eax + Mat_02] :mm4 - y*a23 | x*al3 
These instructions are paired. Same comments as before. 

movq mm7, mm5 :copy vector 

pfmul mm5. QWORD PTR [eax + Mat_22] ;mm5 - w*a43 | z*a33 
Tftese instructions are paired. Same comments as before. 

;AII multiplies for WResult: 
pfmul mm6, QWORD PTR [eax + Mat_03] :mm6 = y*a24 | x*al4 
pfmul mm7, QWORD PTR [eax + Mat_23] ;mm7 = w'^a44 | 2*a34 

7??ese instructions are paired. However, this pair causes a conflict for both the Load unit and the 5D Multiplier 
resources, which stalls one instruction in the scheduler for a dock cycle The instnjctions execute in a staggered 
fashion. The goal for short-decodeable pairs is simultaneous execution. 

;A11 first sums: 
; Of XResult 

pfadd mmO, mml ;mmO = w*a41 + y*'a21 | z*a31 + x*all 

: Of YResult 

pfadd mra2, mm3 :mni2 - w*a42 + y*a22 | z*a32 + x*al2 

These instructions are paired. However, this pair causes a conflict for the 3D ALU, which delay one instruction. 

: of ZResult 

pfadd mm4. mm5 :mm4 - w*a43 + y*a23 | z*a33 + x*al3 
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; of WResult 

pfadd mm6. mm? ;mm6 - w*a44 + y*a24 | z*a34 + x*al4 

These instrudions are paired, but there is a conflict for the 5D ALU and with one of the PFADD instructions from the 
previous pair that was delayed one cycle These dual-decodeable operations serialize execution, eventually stalling 
the scheduler because RISC86 instructions con no longer retire. 



pfacc mmO, mmO 
pfacc mm2, nim2 



All final sums: 
of XResuK 
of YRcsult 



These instructions are paired, but there is a conflict for the 3D ALU See the comments above. 

pfacc nim4, mm4 ; of ZResult 

pfacc mm6, inmo ; of WResul't 

These instructions ore paired, but there is a conflict for the 3D ALU. See the comments above. 

;An result stores: 
movd DWORD PTR [ebx + Vrtx^X], mnnO ; of XResult 
movd DWORD PTR [ebx + Vrtx_Y]. mm2 : of YResult 
These instructions are paired, but there is a conflict for the Store unit 

movd DWORD PTR Lebx + Vrtx_ZJ. Dirr,4 ; of ZResult 
movd DWORD PTR [ebx + Vrtx_W], mm6 ; of WResult 

These instructions are paired, but there is a conflict for the Store Unit as well as the delayed store operation from 

the previous instruction pair 

add ebx. Vertex^Stride :Advance to next vertex 

cmp ebx, edx ;Comparc with ptr to last vertex 

These instructions ore paired, but a dependency on ebx value delays the second instruction by one cycle. 

jbe TransformLoop :!f not done yet 



Optimixed Code Sample: 

: void Transform4x4(Vertex *firstv, 1nt cnt, const Matrix *m) 
\ OPTIMIZED VERSION 

: Full 4x4 matrix transforn o* an array of cnt vertices starting from the 
; vertex pointed to by firstv. using the transform matrix pointed to by nr. 

; Each vertex data structure is assumed to occupy 128 bytes. 16 bytes of 
; which contains the vertex coordinates to be transformed- 

! new..x - x*m[0][0] + y*m[0][l] + z*m[0][2J + w*m[0][3]: 

; new_y = x*ni[lj[0] + y*m[l][l] + 2*m[l][2] ^ w*m[l][3]; 

; new^2 = x*m[21[0] + y*m[2][13 + z-m[2][2] + w*m[2][3]; 

: new_w = x*m[3][01 + y*m[3][l] + z*m[3][2] + w*mt31[3]: 



Vrtx^X equ Oh 

Vrtx_Y equ 4h 

Vrtx_Z equ 8h 
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Vrtx 


W 


equ 


Och 


Mat, 


.00 


equ 


Oh 


Mat 


01 


equ 


4h 


Mat. 


.0? 


equ 


8h 


Mat_ 


.03 


equ 


Och 


Mat_ 


.10 


equ 


lOh 


Mat 


.11 


equ 


14h 


Mat. 


-1? 


equ 


18h 


Mat. 


13 


equ 


Ich 


Mat. 


.20 


equ 


20h 


Mat 


.21 


equ 


24h 


Mat 


22 


equ 


28h 


Mat. 


.23 


equ 


2ch 


Mat. 


30 


equ 


3Dh 


Mat. 


.31 


equ 


34h 


Mat. 


32 


equ 


38h 


Mat. 


.33 


equ 


3ch 



;EAX = m ptr to transforrr matrix 

:EBX = firsLv ptr to first vertex Lo be transformed 
:ECX - cnt count of vertices to be transformed 
The code begins here, but this section is not in the loop. The initio! Loads conflict and stall waiting to load the first 
vertex values and the first four values from the matrix. However, once the loop begins, this code nins efficiently. 
Note that most of these x86 instmctions are four bytes long, which heips to make them short decodeabk. 

;Load first vertex: 
movq mm6. DWORD PTR [ebx] ;mm6 = y | x 

movq mm7. DWORD PTR [ebx + \/rtx_Z] ;nim7 - w | z 

These instructions decode together, but cause a conflict for the Load unit 



Start load of matrix: 
mmO = mOl I mOO 
mml = m03 m02 



movq nimO. DWORD PTR [eax + Mct_00] 
movq mml, DWORD PTR [eax + Met..20] 
Decode together, but conflict for the Load Unit. 
TransformLoop: 

prefetchw [ebx + 128] ;Prefetch next vertex 

The PREFETCHW instruction is a vector decode and takes 2 cycles. However, thb instruction inaeases efficiency 
because it begins the preload of the L 1 data cache wMi the next vertex. A vertex is four dwords or W a cache tine 
However, the 'stride' or distance from one vertex data structure to the next within the vertex array, in this example, 
is I28bytes, which means that each vertex is in a separate cache line. It is assumed that vertex data starts on cache 
line boundaries. From this point forward, the x86 instnjctions form instruction pairs that both decode into one 
Opquad. An Opquad is one line in the instruaion scheduler that is composed of four RISC86 operations. 

movq mm2. DWORD PTR [eax + Mat^Ol] ;min2 - mil | mlO 

This MOVQ instructhn continues to fill in the matrk Separating the matrix load from the multiply instruction ovo/cfe 
serializing the load and multiply, which can lead to a stall of the scheduler The load takes 2-3 cycles to execute 
and the multiply takes 2 cydes to execute. Including the operand fetch stage almost fills the six- stage length of the 
scheduler. 

pfmul mmO. mm6 :mmO = y*m01 | x*raOO 

This PFf\/IUL instruction is paired with the MOVQ instruction. These two instruaions use different resources (Load 
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Unit and 5D ALU, respectively). There are no resource conflicts, no dependencies (mmO should be loaded from 
three cycles eorlier), and the instructions execute together 

movq niin3, DWORD PTR [eex + Mat_21] :mfn3 = ml3 | ml2 

pfmul mml. m7 ;mml = w*m03 | 2*m02 

Some comments as previous instruction pair 

movq mm4. DWORD P'^R [eax + Mat_02] ;niin4 - in21 | m20 

pfmu: mm2. mm6 ;mm2 - y*mll | x*mlO 

Same comments as previous instruction pair, except the load mml was started tm cycles earlier and should be 

fonA/arded from the Load unit to the 3D ALU just-in-time 

movq rmii5, DWORD PTR [eax + Mat_22] ;mni5 = m23 | m22 

pfmul mmS, mm7 ;nim3 - w*ml3 | 2*nil2 

In this pair of instructions, the last free register is loaded for now (mmS). Because there are only eight MMX 
registers^ the registers must be reused and then reloaded with Hie matrix values for the next venex calculation. 

:First sum of XResult: 
pfadd mmO. m\ :nimO = w*m03 + y*m01 j z*fn02 + x*inOO 

pfmu'i min4, mm6 ;nim4 = y*m21 1 x*m20 

These tm 5D instructions can be paired because the 5D ALU and multiplier are separate units and both have 
access to the issue buses for the register X and register Y execution pipelines. Note that at this time the processor is 
operating on eight single-precision, floatingiX)int values (packed into four mmx registers) and the processor 
produces four single-precision values (in two mmx registers). 

;First sum of YResuU: 
pfadd mm2, mnj3 ;nim2 « w*ml3 + y*mll ' z*ml2 + x*mlO 

Tfte mm3 operand is fon/\rarded from the 3D multiplier output 

movq mml. DWORD PTR [eax + Mat_03] :mml = m31 | m30 

The previous two instructions are paired. The MOVQ instruction moves in the first pair of the remaining four matrix 
values. 

pfmul mmS. mm7 ;mm5 = w*m23 | z*m22 

The mmS operand is fon^rarded from the Load unit and the mm? operand is forwarded from the 3D multiplier 

movq mm3, DWORD PTR [eax + Mat_23] :min3 « m33 | m32 

The previous two instructions ore paired. The MOVQ instruction moves in the last pair of matrix values. 

add ebx. Vertex_Stride ;Advance to next vertex 

The pointer to the next vertex is updated. In this example, Vertex jStride = 128. 

pfmul mml. mm6 :mml =- y*m31 | x*m30 

The previous tvro instructions are paired. 

:F1nal sum of XResult and YResult: 
pfacc mmO, mm2 :mmO = YRes | XRes 

The first pair of vertex values are complete and can be stored two dock cycles later (the 3D accumulate has a two- 
cyde execution latency, as do all 3D ALU and Multiply instructions). 

pfmul mm3. mm? ;mm3 - w*m33 | z*m32 

The previous two instructions are paired and use the 3D ALU and Multiplier units simultaneously. 
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:First sum of ZResult 
pfadd ram4, mmS :mm4 « w*m23 + y*m21 | 2*ni22 + x*m20 

Continuing the goal of spreading out dependencies, this instnjction is two cycles after the mmS calculation. 

:Load next vertex 
movq mm6, DWORD PTR [ebx + Vrtx_Xl :min6 - y | x 

The previous two instructions are paired. This MOVQ instruction f)egins to load the next vertex, which the 
PREFETCH instruction has been preloading into the L 1 doto cache, 

;F1rst sum of WResult: 
pfadd mml. mm3 iml = w*m33 + y*ra31 | 2*m32 + x*m30 

;Load next vertex 
movq mm7. DWORD PTR [ebx + Vrtx_Z] ;nm7 - w | z 
The previous two instructions ore paired The second part of the new vertex is haded. 

movq DWORD PTR [ebx - 128 + Vrtx_X],n)mO jStore XResult and YResult 

:Start next iteration 

movq mmO, DWORD PTR [eax 4 Mat_00] ;mmO = mOl | mOO 
The previous two instmctions are paired and can complete simultaneously because the AMD-K6 3D processor has 
separate Load and Store units. Unfortunately, all the matrix values must be reloaded with each iteration because 
there are not enough registers to hold the vertices, the full matrix, and intermediate values 

;Final sum of ZResult and WResult: 
pfacc mm4. mml ;mm4 ^ WRes | ZRes 

:Start next iteration 
movq rami, DWORD PTR [eex + Mat_20J ;mml - m03 | ma2 
The previous two instructions are paired 

movq DWORD PTR [ebx - 123 + V-tx_Z],mm4 :Store ZResult and WResult 

Fortunately, the Store unit can accept data up to two cycles later without a penalty because there are no 
calculations left to hide the execution latency of the last accumulate instruction Therefore, this store is not delayed 

loop TransformLoop ;If not done yet, go to beginning of the loop. 

The previous two instrucb'ons are short decodeabte and paired Note that on the processor, the LOOP instmction 
executes in the same amount of time as the CMP andJBE instmctions in the non-optimized example. However, the 
LOOP instruction, being only one instmction instead of two, is more efficient 

ProgrammillJ Steps The following descriptions review and expand on the steps 
taken to arrive at the optimized code example: 

Schedule code into pairs of short-decodeable x86 instructions 
that correspond to the expected decode pairing — Each 
short-decodeable pair of instructions decodes into four RISC86 
operations that form a set of four Op entries in the instruction 
scheduler. This set of entries moves down the scheduler and 
eventually retires from the bottom of the scheduler buffer. The 
scheduler buffer can hold a total of six sets of entries (which 
represents a total of 24 Op entries). Under ideal conditions of 
uninterrupted decode and execution (no stalls), these entries 
also correspond to clock cycles through the scheduler and 
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execution pipes of the AMD-K6 3D processor. Consequently, the 
programmer should schedule dependent instructions apart 
from each other, in different decode pairs, based on the 
execution latencies of the corresponding RISC86 operations. It 
is cleanest and simplest to use only MOVQ and MOVD 
instructions for memory loads and stores, and use 
register-to-register instructions for computations. In addition, 
this technique has the benefit of minimizing or avoiding 
instruction scheduling delays due to long-latency instructions 
(such as those with a memory load followed by a two-cycle 
register operation), not completing in time and, therefore, not 
being ready to commit results when the entry containing the 
associated RISC86 operations reaches the bottom of the 
scheduler buffer. This situation can lead to a stall when no new 
RISC86 operations can be placed in the scheduler until an entry 
is available. 

Interleave independent sequences of instructions (subject to 
register allocation constraints) to fill each and every decode 
slot — To the extent that this is achieved while maintaining the 
proper minimum distances between dependent operations and 
respecting execution resource constraints, optimal decode 
pairing and instruction execution without delays or stalls is 
very likely to be achieved. 

Use separate moves from memory and register-to-register 
multiplies, instead of register-to-register copies and multiplies 
from memory — This technique allows easy explicit and optimal 
scheduling of memory loads and dependent register operations, 
spaced at least two decode pairs apart and corresponding to the 
two-cycle load execution latency. While this technique 
generally applies to all MMX and 3D instructions, particularly 
avoid the use of the memory form of instructions with two-cycle 
execution latencies (for example, all 3D instructions). In other 
words, optimal performance is best and most easily achieved 
using a RISC coding style (despite the extra MOVD/MOVQ 
instructions). 

Schedule instructions apart that use the same execution 
resources — For example, multiplies should be spread apart. 
The programmer should put at least one decode slot between 
multiplies. Similarly, adds and accumulates, memory loads, and 
memory stores should be spread apart. 
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Use the pipelining ability of accumulate instructions to 
perform two independent accumulates and to pair the resultant 
values together as a 64-bit result — This technique allows the 
use of fewer MOVQ instructions instead of a greater number of 
MOVD instructions. Overall, four accumulate and four MOVD 
instructions get replaced with two and two. In some situations, 
where scalar register results are naturally produced and are 
then stored out to memory via a series of MOVDs, it may be 
preferable to reduce the number of store operations through 
use of PUNPCKLDQ instructions followed by MOVQ 
instructions. Often this optimization may not be worthwhile or 
favorable, particularly given the extra latency introduced by 
the PUNPCKLDQ operations and possibly by memory 
alignment issues for the MOVQ instructions. Typically, it is best 
to spend the overhead to pack initial scalar operands together 
when first read from memory (using MOVD instructions), 
followed by vector computations and MOVQs back to memory. 

Separate the first and second stores by at least two or three 
decode slots (in other words, by one intervening decode pair) 
within a series of two or more stores to a cache line recently 
brought in and not yet written to — This technique is in contrast 
to the second and following stores, which can be in adjacent 
decode pairs. This technique allows an extra cycle for the initial 
MESI-state change to the cache line (from Exclusive to Shared). 

Schedule the ADD/CMP/JCC instructions apart (or at least the 
ADD and CMP instructions) — This scheduling is primarily 
desirable when the ADD and/or CMP instructions reference a 
memory operand and are, therefore, subject to the latency of 
the load operation. In such cases, either the ADD/CMP 
instruction should be scheduled apart from (and ahead of) the 
JCC instructions, or a separate MOV instruction, scheduled 
earlier, should be used to fetch the memory operand. An 
alternative and desirable solution in some cases is to replace 
these instructions with the LOOP instruction (along with 
corresponding setup and usage of the ECX register before and 
within the loop). 

Take advantage of the PREFETCH instruction— In the 
optimized code example, each vertex occupies a different cache 
line (the stride between vertices being 32 bytes or greater). 
Consequently, one cache miss and associated 32 byte line fill 
occurs per loop iteration. To maximize overlap of the cache fill 
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from L2 cache or main memory, use the PREFETCH instruction 
to start the fill before starting to process the current vertex 
(which is already in the cache, having been prefetched at the 
beginning of processing of the last vertex). Specify the address 
elements of the next vertex that will be accessed first. In 
addition, schedule the loads of the next vertex^s data elements 
away from the prefetch instruction. Doing so ensures that the 
load data (which will be the first data of the cache line to be 
fetched) has been received and is available for forwarding to 
these loads while the rest of the fill proceeds to completion. 

Move the first few MOVQs around to the bottom of the loop- 
Typically, the first instructions after the prefetch instruction 
would be a series of MOVQs to get the first vertex and matrix 
elements to operate on, without any other available 
independent operations to fill out these first couple of decode 
pairs. Similarly, near the bottom of the loop, as the last 
computations are performed, there would also be some 
partially-filled decode slots. To fix both of these problems, 
move the first few vertex and matrix element MOVQ 
instructions from the bottom of the loop into the empty slots (as 
well as duplicating these MOVQs in the setup code before the 
start of the loop). 

Pay attention to the alignment of instructions relative to 
32-byte cache line boundaries— The code samples do not show 
the actual memory alignment of instructions and, therefore, 
whether the decode of any instructions may be impacted by 
end-of-cache-line degraded predecode. These code examples 
require a suitable starting alignment (relative to a 32-byte 
address boundary). There also exists the possibility that there is 
no starting alignment for which all instructions can be 
successfully predecoded. In such cases, adjustments to the code 
(such as padding with one-byte or multiple-byte NOPs, 
instruction rearrangement, or different instruction selections) 
may be warranted. In the case of 3D instructions, which can still 
be hardware decoded as a single long decode, the best 
alternative may sometimes be to do nothing. 

Avoid certain address modes with MMX and 3D instructions 
that inhibit instruction predecode — As discussed earlier in the 
section, the [ESI] modR/M address mode (without any 
displacement bytes or index register) inhibits successful 
instruction predecode and should be avoided. In addition, for 
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MMX and 3D instructions, address modes that use a sib byte 
with inod=:00b in the modRyM byte should be avoided. These 
cases consist of [base + (scaled)index] and [disp32 + 
(scaled)index] address modes. 

DMsfon 

The 3D instructions can be used to compute a very fast, highly 
accurate reciprocal or quotient. 

Consider the quotient q = a/b. An (on-chip) ROM-based table 
lookup can be used to quickly produce a 14-to-15-bit precision 
approximation of 1/b. (Using just one 2-cycle latency PFRCP 
instruction). A full 24-bit precision reciprocal can then be 
quickly computed from this approximation using a Newton 
Raphson algorithm. 

The general Newton-Raphson recurrence for the reciprocal is as 
follows: 

X, +1 = Xi . (2 - b . X^.) 

Given that the initial approximation Xq is accurate to at least 14 
bits, and that a full IEEE single-precision mantissa contains 24 
bits, just one Newton-Raphson iteration is required. The 
following sequence shows the 3D instructions that produce the 
initial reciprocal approximation, compute the full precision 
reciprocal from the approximation, and finally, complete the 
desired divide of a/b. 

Xq = PFRCP(b) 

Xi = PFRCPITl(b,Xo) 

X2 = PFRCPIT2(Xi,Xo) 

q = PFMULCa^X^) 

The 24-bit final reciprocal value is X2. In the AMD-K6 3D 
processor 3D implementation, the estimate contains the correct 
round-to-nearest value for approximately 99% of all arguments. 
The remaining arguments differ from the correct 
round-to-nearest value for the reciprocal by 1 
unit-in-the-last-place (ulp). The quotient is formed in the last 
step by multiplying the reciprocal by the dividend a. 
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Optimized 15-Bit 
Precision Divide 



Optimized Fidl 24-Bit 
Precision Divide 



PIpeiined Pair off 
24-BH Precision 
Divides 



This divide operation executes with a total latency of 4 cycles, 
assuming that the programmer is able to hide the latency of the 
first MOVD/MOVQ instructions within preceding code. 



MOVD MMO, [mem] 

PFRCP MMO, MMO 

MOVO MM2, [mem] 

PFMUL MM2, MMO 



0|w 
l/w|l/w 

yjx 
y/wjx/w 



This divide operation executes with a total latency of 8 cycles, 
assuming that the programmer is able to hide the latency of the 
first MOVD/MOVQ instructions within preceding code. 



MOVD MMO. [mem] 

PFRCP MMl, MMO 

PFRPITl MMO; MMl 

MOVQ MM2, [mem] 

PFRCPIT2 MMO, MMl 

PFMUL MM2, MMO 



0|w 
l/w|l/w 



y|x 

l/w|l/w 
y/w I x/w 



This divide operation executes with a total latency of 8 cycles, 
assuming that the programmer is able to hide the latency of the 
first MOVD/MOVQ instructions within preceding code. 



MOVD MMl. 

MOVD MM2. 

PFRCP MMl. 

MOVQ MMO. 

PFRCP MM2. 

PUNPCKLOQ MMl, 

PFRCPITl MMO, 

MOVQ MM2. 

PFRCPIT2 MMO. 

PFMUL MM2. 



[mem] 

[mem+4J 

MMl 

[mem] 

MM2 

MM2 

MMl 

[mem] 

MMl 

MMO 



0 |wO 
0 |wl 
l/wO|l/wO 

l/wl|l/wl 
l/wl|l/wO 

y|x 
l/wl|l/wO 
y/wl |x/wO 



Square Root and Reciprocal Square Root 

The 3D instructions can also be used to compute a reciprocal 
square root or square root with high performance. The general 
Newton-Raphson reciprocal square root recurrence is: 

Xi +1 = 1/2 . Xi . (3 - b • Xt^) 

To reduce the number of iterations, Xq is an initial 
approximation read from a table. The 3D reciprocal square root 
approximation is accurate to at least 15 bits. Accordingly, to 
obtain a single-precision 24-bit reciprocal square root of an 
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Optimized 15-Bit 
Precision Square 
Root 



OptiiiiiEed24-BH 
Precision Square 



input operand b, one Newton-Raphson iteration is required, 
using the following sequence of 3D instructions: 

Xq = PFRSQRT(b) 

Xi « PFMUKXo^Xq) 

X2 - PFRSQITKb.Xi) 

X3 - PFRCPIT2(X2,Xo) 

X4 - PFMUL(b,X3) 

The 24-bit final reciprocal square root value is X3. In the 
AMD-K6 3D processor 3D implementation, the estimate 
contains the correct round-to-nearest value for approximately 
87% of all arguments. The remaining arguments differ from the 
correct round-to-nearest value by 1 ulp. The square root (X4) is 
formed in the last step by multiplying by the input operand b. 

This square root operation can be executed in only 4 cycles, 
assuming a programmer is able to hide the latency of the first 
MOVD instruction within previous code. The reciprocal square 
root operation requires two less cycles than the square root 
operation. 



MOVD MMO. [mem] 
PFRSORT MMl. MMO 
PFMUL MMO. MMl 



0|a 

l/sqrt(a)|l/sqrt(a) 
sqrt(a) |sqrt(a) 



This square root operation can be executed in only 10 cycles, 
assuming a programmer is able to hide the latency of the first 
MOVD instruction within previous code. The reciprocal square 
root operation requires two less cycles than the square root 
operation. 



MOVD MMO, [mem] 

PFRSQRT MMl. MMO 

MOVQ MM2. MMl 

PFMUL MMU MMl 

PFRSQITl MMU MMO 

PFRCPIT2 MMl, MM2 

PFMUL MMO. MMl 



0{d 

l/sqrt(a) |l/sqrt(a) 



l/sqrt(a)|l/sqrt(a) 
sqrt(a)|sqrt(a) 
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x87 Floating-Point Coding Optimizations 

This section describes x87 floating-point code optimization 
techniques specific to the AMD-K6 3D processor 

For optimal floating-point performance — Wherever possible, 
use the packed single-precision, floating-point capability of 3D 
technology instead of the single -precision, double-precision, 
and extended-precision floating-point capabilities of the x87 
floating-point unit. The 3D units are fully pipelined, allow 
vectorized optimizations, are not stack based, and provide 
faster inverse, square root, and inverse square root calculations. 

Avoid vector decoded floating-point instructions — Most 
floating-point instructions are short decodeable. A few of the 
less common instructions are vector decoded. In additional, if a 
short decodeable instruction straddles a cache line, it becomes 
vector decoded. This adds unnecessary overheard that can be 
avoided by inserting NOPs in strategic locations within the 
code. 



Pair floating-point with short-decodeable instructions — Most 
floating-point instructions (also known as ESC instructions) are 
short-decodeable and are limited to the first decoder. The 
short-decodeable floating-point instructions can be paired with 
other short-decodeable instructions. This technique requires 
that floating-point instructions be arranged as the first of a pair 
of short-decodeable instructions. 

Avoid FXCH usage — Pairing FXCH with other floating-point 
instructions does not increase performance. 

Minimize switching between MMX or 3D instructions and FPU 
instructions— Because the MMX/3D registers are mapped onto 
the floating-point register stack, the EMMS or FEMMS 
instruction must be executed after MMX or 3D code and prior 
to the use of the floating-point unit. Group or partition MMX 
and 3D code away from FPU code so that the use of the EMMS 
or FEMMS instructions is minimized. In addition, the actual 
penalty or switch overhead from the use of the EMMS or 
FEMMS instructions occurs not at the time of their execution, 
but when and if the first floating-point instruction is 
encountered. 
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Avoid using MMX/3D registers (and MOVQ instructions) to 
move blocks of double-precision floating-point data in 
memory — Although using 64-bit MOVQ instructions to move 
floating-point data appears fast, using MMX/3D registers 
requires the use of the EMMS or FEMMS instruction and incurs 
switch overhead when switching between these MMX or 3D 
instructions and surrounding floating-point instructions. 

Exploit parallel execution of integer and floating-point 
multiplies— The AMD-K6 3D processor allows simultaneous 
integer and floating-point multiplies using separate, 
low-latency multipliers. 

Do not split floating-point instructions with integer 
instructions — No penalty is incurred when using arithmetic or 
comparison floating-point instructions that use integer 
operands, such as the FIADD instruction or FICOM instruction. 
Splitting these instructions into discrete load and floating-point 
instructions decreases performance. 

Replace FDIV instructions with FMUL where possible— The 
FMUL instruction latency is much less than the FDIV 
instruction. If possible, replace floating-point divisions with 
floating-point multiplication of the reciprocal. 

Use integer instructions to move floating-point data — A 

floating-point load and store instruction pair requires a 
minimum of fotir cycles to complete (two-cycle latency for each 
instruction). The AMD-K6 3D processor can perform one 
integer load and one store per cycle. Therefore, moving 
single-precision data requires one cycle, moving 
double-precision data requires two cycles, and moving 
extended-precision data only requires three cycles when using 
integer loads and stores. The following example shows how to 
translate the C-style code when moving double-precision 
floating-point data: 

double tempi, temp2; 
tenip2 - tempi; 

FLOQWORD PTR [tempi]; Use: MOV EAX, [tempi]; 
FSTP OWORD PTR [temp2]: MOV [temp2]. EAX; 



MOV 
MOV 



EAX, [templ+4]; 
[temp2+4]. EAX; 
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m 



Floating-Point Code 
Sample 



Scheduling of floating-point instructions is unnecessary — The 
AMD-K6 3D processor has a low-latency, non-pipelined 
floating-point execution unit. 

Use load-execute floating-point instructions — The use of a 
load-execute instruction (such as, FADD DWORD PRT [mem]) 
is preferable to the use of a load floating-point instruction 
followed by a floating-point reg, reg instruction. For the 
AMD-K6 3D processor, load-execute arithmetic and compare 
instructions are identical in throughput to floating-point reg, 
reg instructions. Because common floating-point instructions 
execute in two cycles each and the floating-point unit is not 
pipelined, code executes more efficiently if the minimum 
possible number of floating-point instructions are generated. 

The following code sample uses three of the most important 
rules to optimize this matrix multiply routine. The first rule 
used is avoidance of the [ESI] addressing mode. The routine 
forces this code to be [ESI+O], The second rule is the insertion 
of NOPs to avoid cache-line straddles. The third rule used is 
avoidance of vector decoded instructions. 



MATMUL 
db 

FMUL 

FLD 

FMUL 

FLD 

FMUL 

FLD 

FMUL 

FADDP 

FADDP 

FADDP 

FSTP 

NOP 

db 

FMUL 

FLO 

FMUL 

FLO 

NOP 

FMUL 

FLO 

FMUL 

FADDP 

FADDP 



MACRO 

0d9h. 

DWORD 

DWORD 

DWORD 

DWORD 

DWORD 

DWORD 

DWORD 

ST(3). 

ST{2). 

ST(1), 

DWORD 



Od9h 
DWORD 
DWORD 
DWORD 
DWORD 



DWORD 

DWORD 

DWORD 

ST(3). 

ST(2). 



046h. OOh FLD DWORD PTR [ESU-OO] ; 
PTR [EBX] all*x 
PTR [ESI+4];; y 
PTR [EBX+4] ; ; a21*y 
PTR [ESI+8] :: z 
PTR [E8X+8]:; a31*2 
PTR [ESH-12]; : w 
PTR CEBX+12]; ; a^l*w 
ST;: a41*w+a31*z 
ST: : a41*w+a31*z+a21*y 
ST;; a41*w+a31*z+a21*y+all*x 
PTR [EDI]; : store rx 
: : make sure it does not 

straddle across a cache line 
046h. OOh;; FLD DWORD PTR [ESI+00] 
PTR [EBX+16]:; al2*x 
PTR [ESI+4]:; y 
PTR CEBX+20]:: a22*y 
PTR CESI+83;; z 

make sure it does not 
;; straddle across a cache line 



PTR [EBX+24] 
PTR CESI+12] 
PTR [EBX+28] 

ST 

ST 



a32*2 
w 

a42*w 
a42*w+a32*2 
a42*w+a32*2+a22*y 
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FADDP ST(1). ST:: a42*w+a32*2+a22*y+al2*x 



FSTP DWORD PTR [EOI+4];: store ry 

db 0d9h, 046h, OOh:: FLD DWORD PTR [ESI+00] x 

FMUL DWORD PTR [EBX-^32]:: al3*x 

FLD DWORD PTR [E$I+4];; y 

FMUL DWORD PTR [EBX+36];: a23*y 

NOP make sure it does not 



FLD 


DWORD 


PTR 


[ESl+8]:; z 




FMUL 


DWORD 


PTR 


[EBX+40];; a33*z 




FLO 


DWORD 


PTR 


CESI+12]:; w 




FMUL 


DWORD 


PTR 


CEBX+44]:: a43*w 




FADDP 


ST(3) 


ST 


; a43*w+a33*2 




FADDP 


ST(2) 


ST 


; a43*w+a33*z+a23*y 




FADDP 


ST(l) 


ST 


; a43*w+a33*z+a23*y+al3*x 




FSTP 


DWORD 


PTR 


[EDI+B]:: store rz 




db 


0d9h. 


046h, OOh;; FLD DWORD PTR [ESI+OOJ ; 




FMUL 


DWORD 


PTR 


[EBX+48];: al4*x 




FLD 


DWORD 


PTR 


CESl+4];; y 




FMUL 


DWORD 


PTR 


CEBX+52];; a24*y 




FLO 


DWORD 


PTR 


[ESl+8];; z 




FMUL 


DWORD 


PTR 


[EBX+56]:: a34*z 




FLD 


DWORD 


PTR 


[ESl+12]:: w 




FMUL 


DWORD 


PTR 


[EBX+50]:: a44*w 




FAODPSTO). 


ST; 


: a44*w+a34*z 




NOP 




; : make sure i t does not 





FADDPST(l). ST:; a44*w+a34*z+a24*y+al4*x 
FSTPDWORD PTR [EDl+12]:; Store rw 
ENDM 



NOP 



make sure it does not 
straddle across a cache line 



straddle across a cache line 



FADDPST(2). ST:: a44*w+a34*z+a24*y 



:: straddle across a cache line 
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AMD Processor 
Recognition 



Introduction 



Due to the increasing number of choices available in the x86 
processor marketplace, the need for a simple way for hardware 
and software to identify the type of processor and its feature set 
has become critical. The CPUID instruction was added to the 
x86 instruction set for this purpose. 

The CPUID instruction provides complete information about 
the processor (vendor, type, name, etc.) and its capabilities 
(features). After detecting the processor and its capabilities, 
software can be accurately tuned to the system for maximum 
performance and benefit to users. For example, game software 
can test the performance level available from a particular 
processor by detecting the type or speed of the processor. If the 
performance level is high enough, the software can enable 
additional capabilities or more advanced algorithms. Another 
example involves testing for the presence of MMX and 3D 
instructions on the processor. If the software finds this feature 
present when it checks the feature bits, it can utilize these more 
powerful extensions for dramatically better performance on 
new multimedia software. 
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Using the CPUIP Instruction 

Overview 

Software operating at any privilege level can execute the 
CPUID instruction to identify the processor and its feature set. 
In addition, the CPUID instruction implements multiple 
functions, each providing different information about the 
processor, including the vendor, model number, revision 
(stepping), features, cache organization, and processor name. 
The multiple-function approach allows the CPUID instruction 
to return a complete picture about the type of processor and its 
capabilities — more detailed information than could be 
returned by a single function. In addition to gathering all the 
information by calling multiple functions, the CPUID 
instruction provides the flexibility of making only one call to 
obtain the specific data requested once the processor vendor 
has been identified. 

The functions are divided into two types: standard functions 
and extended functions. Standard functions provide a simple 
method for software to access information common to all x86 
processors. Extended functions provide information on 
extensions specific to a vendor's processor (for example, AMD's 
processors). 

The flexibility of the CPUID instruction allows for the addition 
of new CPUID functions in future generations of processors. 
See page 515 for a detailed description of the CPUID 
instruction. 

Testing for the CPUID Instruction 

Beginning with the Am486®DX4 processor, aU AMD processors 
implement the CPUID instruction. In order to avoid an invalid 
opcode exception on those processors that do not support the 
CPUID instruction, software must first test to determine if the 
CPUID instruction is present on the processor. The presence of 
the CPUID instruction is indicated by the ID bit (21) in the 
EFLAGS register. If this bit is writeable, the CPUID instruction 
is implemented on the processor. 
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Software uses the PUSHFD and POPFD instructions to write to 
the ID bit in the EFLAGS register. After reading the ID bit, a 
comparison determines if this operation changed the value of 
the ID bit. If the value changed, the CPUID instruction is 
available for identifying the processor and its features. The 
following code sample demonstrates the way a program uses the 
PUSHFD and POPFD instructions to test the ID bit. 



pushfd 

pop eax 

mov ebx, eax 

xor eax. 00200000h 

push eax 

popfd 

pushfd 

pop eax 

cmp eax, ebx 

jz NO^CPUID 



Save EFLAGS to stack 

Store EFLAGS in EAX 

Save in EBX for testing later 

Switch bit 21 

Copy ''changed" value to stack 
Save "changed" EAX to EFLAGS 
Push EFLAGS to top of stack 
Store EFLAGS in EAX 
See if bit 21 has changed 
If no change, no CPUID 



Using CPUID Functions 

When software uses the CPUID instruction to identify a 
processor, it is important that it uses the instruction 
appropriately. The instruction has been defined to make it easy 
to identify the type and features of x86 processors 
manufactured by many different vendors. 

The standard functions (EAX=0 and EAX=1) are the same for 
all processors. Having standard functions simplifies software's 
task of testing for and implementing features common to x86 
processors. Software can test for these features and, as new x86 
processors are released, benefit from these capabilities 
immediately. 

Extended functions are specific to a vendor's processor. These 
functions provide additional information about AMD 
processors that software can use to identify enhanced features 
and functions. To test for extended functions, software checks 
for "Authentic AMD" in the vendor identification string 
returned by function 0 and for a non-zero value in the EAX 
register returned by function SOOO^OOOOh. 

Within AMD's family of processors, different members can 
execute a different number of functions. Table 92 on page 508 
summarizes the CPUID functions currently implemented on 
AMD processors. 
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Table 92. Summary of CPUID Functions in AMD Processors 

(See page 515 for detailed descriptions of the functions.) 



Standard 
Function 


Extended 
Function 


Description 


AMD-K5 
Processor 


AMD-K5 
Processor 
(model 1, 
2, and 3) 


AMD-K6 
Processor 
(model 6, 
1, and 8) 


AMD^K6 
Processor 

^lilUUCI 9/ 


0 




Vendor String and Largest Standard 
Function Value 


X 


X 


X 


X 


1 




Processor Signature and Standard 
Feature Bits 


X 


X 


X 


X 




SOOO.OOOOh 


Largest Extended Function Value 




X 


X 


X 




8O00_00Olh 


Extended Processor Signature and 
Extended Feature Bits 




X 


X 


X 




8000_0002h 


Processor Name 




X 


X 


X 




8000«0003h 


Processor Name 




X 


X 


X 




8000.0004h 


Processor Name 




X 


X 


X 




8000_0005h 


LI Cache Information 




X 


X 


X 




8000_0006h 


L2 Cache Information 








X 


Mote: 

Future v&sions of these processors rrtay implement additbnd funcdons. 



Identifying the Processor's Vendor 

Software must execute the standard function EAX=0. The 
CPUID instruction returns a 12-character string that identifies 
the processor's vendor. The instruction also returns the largest 
standard function input value defined for the CPUID 
instruction on the processor. 

For AMD processors, function 0 returns a vendor string of 
''AuthenticAMD". This string informs the software to follow 
AMD's definition for subsequent CPUID functions and the 
registers returned for those functions. 

Once the software identifies the processor's vendor, it knows 
the definition for all the functions supplied by the CPUID 
instruction. By using these functions, the software obtains the 
processor information needed to properly tune its functionality 
to the capabilities of the processor. 
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Determining tiie Processor Signature (StandanI Function) 

Standard function 1 (EAX=1) of the CPUID instruction returns 
the standard processor signature and feature bits. The standard 
processor signature is returned in the EAX register and 
provides information regarding the specific revision (stepping) 
and model of the processor and the instruction family level 
supported by the processor. The revision level is used to 
determine if the processor requires the implementation of 
software workarounds. Figure 119 shows the contents of the 
EAX register obtained by function 1. Table 93 on page 510 
summarizes the specific processor signature values returned for 
AMD processors. 




Instnidion Famjly 
Model 
Stepping 



Figure 1 19. Contents of EAX Register Returned by Function 1 
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Table 93. Summary of Processor Signatures for AMD Processors 

(See page 515 for details on bit locations and values.) 



Processor 


Instruction 
Family 


Model 


Steroinff ID 


Am486 and AmS^Bo 
Processors 


0100b (4h) 


yyyy^ 


XXXX^ 


AMD-K5 Processor 
(Model 0) 


0101 D(5n} 


OOOOd (On) 


xxxx' 


AMD-K5 Processor 
(Model 1) 


0101b (5h) 


0001b (1h) 


xxxx^ 


AMD-K5 Processor 
(Model 2) 


0101b (5h) 


OOlOb (2h) 


xxxx' 


AMD-K5 Processor 
(Model 3) 


0101b (5h) 


0011 b(3h) 


xxxx' 


AMD-K6 Processor (Model 6) 


0101b (5h) 


0110b(6h) 


xxxx' 


AMD-K6 Processor (Model 7) 


010?b(5h) 


onib(7h) 


xxxx' 


AMD-KG 3D Processor (Model 8) 


0101b (5h) 


1000b (8h) 


xxxx' 


AMD-K6 3D+ Processor (Model 9) 


0101b (5h) 


1001b (9h) 


xxxx' 


Notes: 

1 Contact your AMD represented for the latest stepping infonnation. 

2 Model identifier infonnation is provided in tfie AMD BIOS Development Guide, document 
numlxr 19720. 



Identifying Supported Features 

The feature bits are returned in the EDX register for two 
CPUID functions — standard function 1 and extended function 
8000_0001h. Each bit corresponds to a specific feature and 
indicates if that feature is present on the processor. Table 94 on 
page 511 summarizes the standard feature bits, and Table 95 on 
page 512 summarizes the extended feature bits. 

Before using any of the enhanced features added to the latest 
generation of processors, software should test each feature bit 
returned by functions 1 and 8000_0001h to identify the 
capabilities available on the processor. For example, software 
must test bit 23 to determine if the processor executes MMX 
instructions. Attempting to execute an imavailable feature can 
cause errors and exceptions. 

Bit 31, as returned by extended function 8000_0001h, 
designates the presence of 3D technology. Other processor 



510 



177AMD0060546 



AMD Processor Recognition 




vendors have adopted this technology so now bit 31 is 
considered an open standard. An alternate way to test for the 
presence of 3D technology (as opposed to testing for 
AuthenticAMD) is for software to implement the following 
algorithm: 

1. Test for the CPUID instruction. (See "Testing for the CPUID 
Instruction" on page 506.) 

2. Execute the CPUID extended function 8000„0000h. 

3. Test if the value returned in the EAX register is greater 
than or equal to SOOO.OOOOh. 

4. Execute the CPUID extended function 8000_0001h. 

5. Test bit 31 in the EDX register for 3D technology. 



Table 94. Summary of Standard Feature Bits for AMD Processors 

(See page 515 for details on bit locations and values.) 



Feature 


Description 


Floating-Point Unit 


A floating-point unit is available. 


Virtual Mode tensions 


Virtual mode extensions are available. 


Debugging Extensions 


I/O breakpoint debug extensions are supported. 


Page Size Extensions 


4-Mbyte pages are supported. 


Time Stamp Counter 

(with RDTSC and CR4 disable bit) 


A time stamp counter is available in the processor, 
and the RDTSC instruction is supported. 


K86 Model-Specific Registers (with 
RDMSRandWRMSR) 


The K86 model-specific registers are available in 
the processor, and the RDMSR and WRMSR 
instructions are supported. 


Machine Check Exception 


The machine check exception is supported. 


CMPXCHG8B Instruction 


The CMPXCHC8B instruction is supported 


APIC 


A local APIC unit is available. 


Global Paging Extension 


Global paging extensions are available. 


Conditional Move Instructions 


The conditional move instructions CMOV, FCMOV, 
and FCOMI are supported. 


MMX Instructions 


MMX Instructions are supported. 
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Table 95. Summary of Extended Feature Bits for AMD Processors 



(See page 515 for details on bit locations and values.) 



Feature 


Description 


Floating-Point Unit 


A floating-point unit is available. 


Virtual Mode Extensions 


Virtual mode extensions are available. 


npliiiooino Fvtffncinnc 

UCUU^]^U1|^ bAlClljlUII> 


I/O hrMlrnninI rfphiio pvtpnctnnc ;irp ciinnnrtpH 


Page Size Extensions 


4-Mbyte pages are supported. 


Time Stamp Counter 

(with RDTSC and CR4 disable bH) 


A time stamp counter Is available in the processor, 
and the RDTSC instruction is supported. 


K86 Model-Specific Registers (with 
RDMSRandWRMSR) 


The K86 model-specific registers are available in 
tne processor, ana inc kuivijk ana vvi\/vi jt\ 
instructions are supported. 


Machine Check Exception 


The machine check exception is supported. 


CMPXCHG8B Instruction 


The CMPXCHC8B instruction is supported. 


Global Paging Extension 


Global paging extensions are available. 


SYSCALL and SYSRET Instructions 


The SYSCALL and SYSRET instructions and 
associated extensions are supported. 


Integer Conditional Move 
Instruction 


The integer conditional move instruction CMOV is 
supported. 


Floating-Point Conditional Move 
Instructions 


The floating-point conditional move instructions 
FCMOV and FCOMI are supported 


MMX Instructions 


MMX instructions are supported. 


3D instructions 


3D instructions are supported. 
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Testing For Extended Functions 

Once software has identified the processor's vendor as AMD, it 
must test for extended functions by executing function 
8000_0000h. The EAX register returns the largest extended 
function input value defined for the CPUID instruction on the 
processor. If this value is non-zero, extended functions are 
supported. 

To simplify identifying processors and their features, the AMD 
extended functions include all the information provided in the 
standard functions as well as the additional AMD-specific 
feature enhancements. This duplication can minimize the 
number of function calls required by software. For more 
information, see 3D Feature Detection on page 83 and MMX 
Feature Detection on page 355 



Determining tlie Processor Signature (Extended Function) 

Extended function SOOO.OOOlh returns the AMD processor 
signature. The signature is returned in the EAX register and 
provides generation, model, and stepping information for AMD 
processors. Figure 120 shows the contents returned in the EAX 
register. 



12 11 8 7 43 0 




Ceneration/FamiV 
Model 
Stepping 



figure 120. Contents of EAX Register Returned by Extended Function 8000.0001 li 
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Displaying the Processor's Name 

Functions 8000_0002h, 8000_0003h, and 8000_0004h return an 
ASCII string containing the name of the processor. These 
functions eliminate the need for software to search for the 
processor name in a lookup table, a process requiring a large 
block of memory and frequent updates. Instead, software can 
simply call these three functions to obtain the name string (48 
ASCn characters in little endian format) and display it on the 
screen. Although the name string can be up to 48 characters in 
length, shorter names have the remaining byte locations filled 
with the ASCn NULL character (OOh). To simplify the display 
routines and avoid using screen space, software only needs to 
display characters until a NULL character is detected. 



Displaying Caclie information 

Functions 8000_0005h and 8000_0006h (function 8000_0006h is 
only supported in AMD-K6® 3D+ Model 9) provide cache 
information for the processor. Some diagnostic software 
displays information about the system and the processor's 
configuration. It is common for this type of software to provide 
cache size and organization of information. Functions 
8000_0005h and 8000_0006h provide a simple way for software 
to obtain information about the on-chip cache and TLB 
structures. The size and organization information is returned in 
the registers as described on page 515. Software can simply 
display these values^ eliminating the need for large pieces of 
code to test the memory structures. 



Sample Code 



A code sample that uses the CPUID instruction to identify the 
processor and its features is available from AMD's website at 
http://www.amd.com/k6/k6docs/. 
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CPUID 



mnemonic 



opcode 



description 



CPUID 
Privilege: 

Registers Affected: 
Flags Affected: 



OF A2h Identify the processor and its feature set 
none 



EAXEBXECXEDX 



none 



Exceptions Generated: none 

The CPUID instruction is an application-level instruction that software executes to 
identify the processor and its feature set. This instruction offers multiple functions, 
each providing a different set of information about the processor. The CPUID 
instruction can be executed from any privilege level. Software can use the information 
returned by this instruction to tune its functionality for the specific processor and its 
features. 

Not all processors implement the CPUID instruction. Therefore, software must test to 
determine if the instruction is present on the processor. If the ED bit (21) in the 
EFLAGS register is writeable, the CPUID instruction is implemented. 

The CPUID instruction supports multiple functions. The information associated vdth 
each function is obtained by executing the CPUID instruction with the function 
number in the EAX register. Functions are divided into two types: standard functions 
and extended functions. Standard functions are found in the low function space, 
0000_OOOOh-7FFF_FFFFh. In general, all x86 processors have the same standard 
function definitions. 

Extended functions are defined specifically for processors supplied by the vendor 
listed in the vendor identification string. Extended functions are found in the high 
function space, 8000_0000h-8FFF_FFFFh. Because not all vendors have defined 
extended fimctions, software must test for their presence on the processor. 

AMD processors have extended functions under the following conditions: 

■ The processor returns the ** Authentic AMD" vendor identification string. 

■ The 8000_0000h function returns a non-zero value in the EAX register. 
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standard Functions 

Function 0 - largest Standard Function Input VUue and Vendor Identification String 

Input: EAX = 0 

Output: EAX = Largest function input value recognized by the CPUED instruction 
EBX, EDX, ECX = Vendor identification string 

This is a standard function found in all processors implementing the CPUID 
instruction. It returns two values. The first value is returned in the EAX register and 
indicates the largest standard function value recognized by the processor. The second 
value is the vendor identification string. This 12-character ASCII string is returned in 
the EBX, EDX, and ECX registers in littie endian format, 

AMD processors return a vendor identification string of "AuthenticAMD** that 
software uses as follows: 

n To identify the processor as an AMD processor 

■ To apply AMD's definition of the CPUID instruction for all additional function 
calls 



Function 1 - Processor Signature and Standard Feature Flags 

Input: EAX = 1 

Output: EAX = Processor Signature 
EBX = Reserved 
ECX = Reserved 
EDX =: Standard Feature Flags 

Function 1 returns two values — the Processor Signature and the Standard Feature 
Flags. The processor signature is returned in the EAX register and identifies the 
specific processor by providing information on its type — instruction family, model, 
and revision (stepping). The information is formatted as follows: 

■ EAX[3-0] Stepping ID 

n EAX[7-4] Model 

n EAX[ll-8) Instruction Family 

n EAX[31-12] Reserved 

The standard feature flags are ret\umed in the EDX register and indicate the presence 
of specific features. In most cases, a "1" indicates the feature is present, and a "0" 
indicates the feature is not present. Table 96 on page 517 contains a list of the 
currently defined standard feature flags. Reserved bits will be used for new features 
as they are added. 
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TaUe 96. Standard Feature Hag Descriptions 



Bit 


Feature 


Description 


n 


riiiaiin^ruiiii unn 


0 = No FPU 

1 = FPU Present 


1 


Virtual Mode Extensions 


0 No Support 
1« Support 


2 


Debugging Extensions 


0 = No Support 
1= Support 


3 


Page Size Bctensions 


0 = No Support 

1= Support 4Mbyte Pages 


4 


Time Stamo Counter ^mth RDTSC and CR4 disable bit) 


0 = No Support 
1= Support 


5 


K86 Model-SDedfic Recisters (with RDMSR and WRMSR) 


0 = No Support 
1= Support 


6 


Reserved 




7 


Machine Check Bcception 


0 = No Support 
1= Support 


8 


CMPXCHG8B instruction 


0 = No Support 
1= Support 


9 


APIC* 


0==No Support 
1= Support 


10-11 


Reserved 


— 


12 


Memory Type Range Registers 


0=:No Support 
1= Support 


13 


Global Paging Extension* 


0 = No Support 
1= Support 


14 


Reserved 




15 


Conditional Move instruction 


0 = No Support 
1= Support 


16-22 


Reserved 




23 


MMX Instructions 


0 = No Support 
1= Support 


24-31 


Reserved 




fMe: 

* The AMD-K5 processor (model 0) reserves bit 13 and implements feature bit 9 to indicQte support for Global Paging Extensions 
instead of support forAPIC 
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Extended Functions 

Function 8000_0000h - Larsest Extended Function Input Valne 

Input: EAX = SOOO^OOOOh 

Output: EAX = Largest function input value recognized by the CPUID instruction 
EBX = Reserved 
ECX = Reserved 
EDX = Reserved 

Function SOOO^OOOOh returns a value in the EAX register that indicates the largest 
extended function value recognized by the processor. 

Function 8000„000lb - AMD Processor Signature and Extended Feature Flags 

Input: EAX = 8000_0001h 

Output: EAX s AMD Processor Signature 
EBX = Reserved 
ECX = Reserved 
EDX = Extended Feature Flags 

Function 8000_0001h returns two values— the AMD Processor Signature and the 
Extended Feature Flags. The AMD processor signature is returned in the EAX 
register and identifies the specific processor by providing information regarding its 
type — generation/family, model, and revision (stepping). The information is 
formatted as follows: 

■ EAX[3-0] Stepping ID 

■ EAX[7-4] Model 

■ EAX[ll-8] Generation/Family 

■ EAX[31-12] Reserved 

The extended feature flags are returned in the EDX register and indicate the 
presence of specific features found in AMD processors. In most cases, a "1" indicates 
the feature is present, and a "0" indicates the feature is not present. Table 97 on 
page 519 contains a list of the currently defined extended feature flags. Reserved bits 
will be used for new features as they are added. 
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Table 97. Extended Feature Rag Descriptions 



Bit 


Feature 


Descriptton 


n 




0 = No FPU 
l=FPU Present 


1 


Virtual Mode Extensions 


0 = No Support 
Is Support 


2 


Debugging Extensions 


0 = No Support 
1= Support 


3 


Page Size Extensions 


0 = No Support 

1= Support 4Mbyte Pages 


4 


Hme Stamp G)unter (with RDTSC and CR4 disaMe bit) 


0 = No Support 
1= Support 


5 


K86 Model-Specific Registers (with RDMSR and WRM5R) 


0 = No Support 
1= Support 


6 


Reserved 




7 


Madhine Checic Exception 


0 = No Support 
1= Support 


8 


CMPXCHG8B Instaiction 


OsNo Support 
Is Support 


9-10 


Reserved 




11 


SYSCALL and SYSRET Instructions 


0 = No Support 
1= Support 


12 


Reserved 


— 


15 


Global Paging Extension 


0«No Support 
1= Support 


14 


Reserved 




15 


Integer Conditional Move Instruction 


0 s No Support 
1= Support 


16 


Floating-Point Conditional Move Instructions 


0 = No Support 
1= Support 


17-22 


Reserved 




23 


MMX Instructions 


0 = No Support 
1= Support 


24-30 


Reserved 




31 


3D Instructions 


0 = No Support 
1= Support 
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Functions 8000.0002h, 8000_000lh, and SOOO.OOMh - Processor Name String 



Input: 



EAX 



8000_0002h, 8000_OOO3h, or 8000_0004h 



Output: 



EAX 
EBX 
ECX 
EDX 



Processor Name String 
Processor Name String 
Processor Name String 
Processor Name String 



Functions 8000_0002h, 8000_0003h, and 8000_0004h each return part of the processor 
name string in the EAX, EBX, ECX, and EDX registers. These three functions use the 
four registers to return an ASCII string of up to 48 characters in little endian format. 
For example, function 8000_0002h returns the first 16 characters of the processor 
name. The first character resides in the least significant byte of EAX, and the last 
character (of this group of 16) resides in the most significant byte of EDX. The NULL 
character (ASCII OOh) is used to indicate the end of the processor name string. This 
feature is useful for processor names that require fewer than 48 characters. 

Function oooo^ooosii - Li Cache Information 

Input: EAX = 8000_0005h 

Output: EAX = Reserved 

EBX = TLB Information 

ECX = LI Data Cache Information 

EDX = LI Instruction Cache Information 

Function 8000_0005h returns information about the processor's on-chip LI caches and 
associated TLBs. Tables 98, 99, and 100 on page 521 provide the format for the 
information returned by the 8000_0005h function. 
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Table 98. EBX Format Returned by Function B000_0005h 





Data TLB 


Instruction TLB 


Assodathrity* 


# Entries 


Associativity* 


# Entries 


EBX 


Bits 31-24 


Bits 25-16 


Bits 15-8 


Bits 7-0 


Note: 

* Faff associativity is indicated by a vafue ofOFFh. 



Table 99. EQ Format Returned by Function 8000_0005b 





LI Data Cache 




Size (Kbytes) 


Associativity* 


Lines per Tag 


Line Size (bytes) 


EQ 


Bits 31-24 


Bits 23-16 


Bits 15-8 


Bits 7-0 


Note: 

* Full associativity is indicated by a value of OFFh. 



Table 100. EDX Format Returned by Function 8000_0005h 





LI Instruction Cache 




Size (Kbytes) 


Associativity* 


Lines per Tag 


Line Size (bytes) 


EDX 


Bits 31-24 


Bits 23-16 


Bits 15-8 


Bits 7-0 


Hole: 

* FuH associativity is indicated by a value of OFFb. 



521 



177AMD0060557 



r AMD Processor Recognition 

Fanctioii 8000.0006h - L2 Cache Infennatlon 

This function is available on the AMD-K6 3D+ processor Model 9. 

Input: EAX = 8000^0006h 

Output: EAX = Reserved 
EBX = Reserved 

ECX = L2 Unified Cache Information 
EDX - Reserved 

Function 8000„0006h returns information about the processor's L2 cache. Table 10 
provides the format for the information returned by the 8000_0006h function. 



Table 101. ECX Format Retumed by Function 8000.0006h 





UCadie 


Site (Kbytes) 


Assodativity* 


Lines per lig 


Line Size (bytes) 


Ea 


Bits 31-16 


Brts 15-12 


Bits 11-8 


Bits 7-0 


Note: 

* Full associativity is indicated by a value of Ofh. 
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Value s Iteturned by theCPUID Instruction 

Table 102 contains the values returned by the CPUID 
instruction for AMD processors models 6 through 9. 



Table 102. Values Returned By AMD Processors 



Rmction 

Register 


AMD-X6 
Processor 
(Model 6) 


AMD-XG 
Processor 
(Model 7) 


AMD-K6 3D 
Processor 
(Model 8) 


AMD-K6 3D4 
Processor 
(Model 9) 


Function: 0 

EAX 
EBX 
EQ 
EDX 


0000_0001h 
6874_7541h 
444D^4163h 
6974_6E65h 


0000^0001 h 
6874.7541 h 
444D.4163h 
6974.6E65h 


0000.0001 h 
6874.7541 h 
444D_4163h 
6974_6E65h 


OOOO.OOOIh 
6874.7541 h 
444D.4163h 
6974_6E65h 


Function: ] 

EAX 
EBX 

Ea 

EDX 


0000.056Xh 
Reserved 
Reserved 

0080.01 BFh 


0000.057X11 
Reserved 
Reserved 

0080.01 BFh 


OOOO.OSBXh 
Reserved 
Reserved 

0080.01 BFh 


0000.059Xh 
Reserved 
Reserved 

0080.21 BFh 


Function: 
8000_0000h 

EAX 
EBX 

Ea 

EDX 


8000_0D05h 
Reserved 
Reserved 
Reserved 


8000_0005h 
Reserved 
Reserved 
Reserved 


8000.0005h 
Reserved 
Reserved 
Reserved 


Booo.oooeh 

Reserved 
Reserved 
Reserved 


Function: 
8000.0001 h 

EAX 
EBX 

Ea 

EOX 


0000_066Xh 
Reserved 
Reserved 

0080_0IBFh 


0000.067Xh 
Reserved 
Reserved 

0080.09BFh* 


OG00.068Xh 
Reserved 
Reserved 

8080.096Fh 


0000.069Xh 
Reserved 
Reserved 

8080.29BFh 


Function: 
800O^0O02h 

EAX 
EBX 
EQ 
EDX 


2D443D41h 
6D74_364Bh 
202F_7720h 
746C_756Dh 


2D44_4D41h 
6D74_364Bh 
202F_7720h 
746C_756Dh 


2D44_4D41h 
7428.364Bh 
3320_296Dh 
7270.2044h 


2D44„4D41h 
7428_364Bh 
3320_296Dh 
5020_2B44h 


Note: 

* A value of 0080 OIBFh is returned by the AMD-K6 processor Model 7 with ar) A siepping (a 0000 OSTOh value is returned in EAX 
by CPUID Function 0- 
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Table 102. Values Returned By AMD Pracessois (continued) 



Function 

Register 


AMD-K6 
Processor 
(Modd 6) 


AMD-K6 
Processor 
(Model 7) 


AMD-K6 30 
Processor 
(Models) 


AMD-K6 5D-^ 
Processor 
(Model 9) 


Function: 
8000„0003h 

EAX 
E8X 

Ea 

EDX 


6465_6D69h 
6520_6169h 
6E65_7478h 
6E6F_6973h 


6465_6D69h 
6520.6169h 
6E65.747«1 
6E6F.6973fl 


7365_636Fh 
0072„6F73h 

oooo^ooooh 
oooo.ooooh 


6563_5F72h 
726F_7373h 
OOOO.OOOOfi 

oooo.ooooh 


Function! 
8000.0004h 

EAX 
EBX 
EQ 
EDX 


D000_0073h 
OOOO^OOOOh 
0000_0000h 
0000_0000h 


0000_0075h 
OOOD^OOQOh 
OOOO^OOOOh 

oooo^ooooh 


OOOO^ODOOh 
OOOO^OOOOh 

oooo^ooooh 

OOGO^OOOOh 


0000„0000h 
0000„0000h 
OOOO^OOOOh 
OOOO^OOOOh 


Functu)n* 
8000_0005h 

EAX 
EBX 

EQ 
EDX 


Reserved 
0280^0140h 
2002_0220h 
2002_0220h 


Reserved 
0280.0140h 
200Z^0220h 
2002.0220h 


Reserved 
O280.(n40h 
2002„0220h 
2002J220h 


Reserved 
0280^0140h 
2002_0220h 
2002^0220h 


Function: 
8000J006h 

EAX 
EBX 

Ea 

EDX 


Undefined 
Undefined 
Undefined 
Undefined 


Undefined 
Undefined 
Undefined 
Undefined 


Undefined 
Undefined 
Undefined 
Undefined 


Reserved 
Reserved 
0100_4220h 
Reserved 


Note: 

* A value of 0080 OlBniisretumedbyiheAMD-KBimxessorModdyvmanAsteppingfaO^ 
by CPUID Function O- 
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Descriptions, Signal 137 

Design, Thermal 331 

Designations, Pin 342 

Device Identification Register 276 

Diagram, Pin Description 340 

Diagrams, Timing 187-226 

Dm 276 

DisabKng, Cache 238 

Displaying 

cache information 514 

the processor's name 514 

Dissipation, Power 307 

Divide 

optimized 15-bit precision 498 

optimized full 24-bit precision 498 

DiTides, Pipelined Pair of 24-Bit Precision 498 

Division 95, 497 

square root 95, 498 

DPI 7:0 J 155 

DR3-DR0 286 

DR5-DR4 286 

DR6 287 

DR7 287 

Drive Strength, Selectable 310 

Driven 139 



EADS* 156 

EFER 39,41, 231 

EFLAGS Register 33, 83, 506-507 

Elcctrica] Data 303 

EMMS Instruction 85, 359, 361, 474, 482 

Environment, Software 23 

EWBE# 157,292 



Exception 144-145, 155, 158, 172 

222, 256, 267, 287-289, 358, 475, 511 

flags 28-29 

floating-point 158, 163, 254, 256 

handler 283 

machine check 39, 511-512, 517, 519 

Exceptions 

3D 89,93-94,256 

and interrupts 51 

debug 288 

floating-point - 254 

handling floating-point 254 

interrupts, and dc^ug in SMM 267 

MMX 256, 358 

Execuiion Resources, 3D 89 

Execution Unit 16, 82 

3D 18, 84, 255, 460, 482, 500 

branch 22 

floating-point 2, 27, 82, 253, 464, 482 

500, 502, 511-512, 517, 519 

integer X 18 

integer Y 18 

load 91, 462, 467, 469-471, 489, 491-492 

multimedia 2, 18, 84, 255, 460 

store 463, 467, 470, 490, 493 

terminology 458 

see also Unit 

Execution Units 1, 7, 18, 456, 465-466 

and Dependency Latencies 458 

register 460 

Extended Functions. 83, 506-507, 513, 515, 518 

External 

address strobe 156 

write buffer empty 157 

EXTEST Instruction 277 



Feature Detection 510 

3D 83 

MMX 355 

FEMMS Instruction 82, 85, 87, 91-92, 96, 474, 482 

FERR# 158, 254, 256 

Fetch Unit 12 

Fetch, Instruction 12 

Float Conditions. 181, 185 

Floated 139 

Floating-Point 

and MMX/3D instruction compatibility 256 

and multimedia execution units 253 

code sample 502 

error 158 

handling exceptions 254 

register data types 30 

registers 27 

unit 253, 464, 511-512, 517, 519 

FLUSH* 159, 227, 248, 251, 270, 292 

Frequency 296, 314-315, 326 

multiplier 152 

operaring 147, 152, 228 

FRSTOR 356 

FSAVE 356 

Function 0 83, 507-508, 516 
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Funcrion 1 83, 509-510, 516 

Function 8000_0000h 507, 511, 513, 518 

largest standard function 518 

Function 800D_0001h 510, 513, 518 

processor signature 518 

Function 8000_0005h 514. 520-521 

cache information 520 

Functions 8000_0002h» 8000_0003h, and 

8000_0004h 514, 520 

processorname string 520 



Gate Descriptor 48, 51 

General-Purpose Registers 23 

Grounding, Power and 299, 339 



H 

Halt State 292 

Handling FIoating>Point Exceptions 254 

Heat Dissipation Path 334 

HIGHZ Instruction 278 

History Table. Branch 21 

Hit to 

modified line 160 

modified line, AHOLD initiated inquire 210 

modified line. HOIJD-initiated inquire 205 

shared or exclusive line, AHOLD-iniiiated inquire 209 

shared or exclusive line, HOLD-initiated inquire 203 

Hrr# 160 

HITM# 160, 310 

HLDA 161 

HOLD 162 

•initiated inquire hit to modified line 205 

-initiated inquire hit to shared or exclusive line 203 

Hold 

acknowledge 161, 202-203 

acknowledge cycle 202 

timing 313, 328 



I 

VO 

buffer AC and DC characteristics 312 

buffer characteristics 309 

buffer model 310 

misaligned read and write 200 

model application note 312 

read and write 199 

trap dword 265 

trap restart slot 266 

IBIS 310 

IDCODE instruction 278 

Identifying 

supported features 510 

the processor^ vendor 508 

IEEE 85, 88 

IEEE 1149.1 1,271 

IEEE 754 1, 27, 253 

IEEE 854 253 

TGNNE» 163,254,256 

Ignore Numeric Exception 163 



mrr 164, 292 

•initiated transition from protected mode to 

real mode 225 

state of processor after 232 

Initialization 164 

power-on configuration and 227 

Input Setup and Hold Timings 

for 100-MHz bus operation 318 

for 66-MHz bus operation 322 

Inquire 204,206,208,291 

bus arbitration cycles 201 

cycle hit 160 

cycles 140-145, 156, 160-161. 177, 183 

198, 201, 203, 205, 207, 209-210 

212-213, 216, 247-251. 282, 291-293, 295 

miss, AHOLD-initiated 207 

Instruction 

decode 13,456 

fetch 12 

formats, 3D 87 

formats, MMX 354 

pointer 27 

prefetch 10 

Instructions 52-81 

3D 79, 82-84, 87-135, 255. 512, 519 

CPUID 83, 366, 369, 505-508, 510-511 

514-516, 518, 523 

EMMS 15, 85, 359, 361, 474, 482 

FEMMS 15, 82, 87, 96, 474, 482 

FERR# 254, 256 

FLUSH* 251 

FRSTOR 356 

FSAVE 356 

IGNNE# 254, 256 

nm) 248 

MMX 75, 87, 25S, 353, 360-453 

MOVD 352, 362 

MOVQ 352, 363 

FACKSSDW 364 

PACKUSWB 369 

PADDB 372 

PADDD 374 

PADDSB 376 

PADDSW 378 

PADDUSB 380 

PADDUSW 382 

PADDW 384 

PAND 386 

PANDN 388 

PAVGUSB 97 

PCMPEQB 390 

PCMPEQD 392 

PCMPEQW 394 

PCMPGTB 396 

PCMPGTD 398 

PCMPGTW 400 

PF2n) 99 

PFACC 101 

FFADD 103 

PFCMPEQ 105 

PFCMPGE 107 

PFCMPGT 109 

PFMAX Ill 

PFMJN H3 

PFMUL 87, 115 
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PFRCP 117 

PFRCPm 119 

pFRCPrrz 121 

PFRSQITl 123 

PFRSQRT 125 

PFSUB 127 

PFSUBR 129 

PI2FD 131 

PMADDWD 349, 402 

PMULHRW 132 

PMULHW 404 

PMULLW 406 

POPFD 507, 512 

POR 408 

PREFETCH 11, 87, 134, 482, 512 

PREFETCHW 134 

PSLLD 410 

PSLLQ 412 

PSLLW 414 

PSRAD 416 

PSRAW 418 

PSRLD 420 

PSRLQ 422 

PSRLW 424 

PSUBB 426 

PSUBD 428 

P5UBSB 430 

PSUBSW 432 

PSUBUSB 434 

PSUBUSW 436 

PSUBW 438 

PUNPCKHBW 440 

PUNPCKHDQ 442 

PUNPCKHWD 444 

PUNPCKLBW 446 

PUNPCKLDQ 448 

PUNPCKLWD 450 

PUSHFD 507 

PXOR 452 

supported 1^ the AMD-K6 3D processor 52 

SYSCALL 519 

SYSRET 519 

TAP 277 

WBINVD 248, 251 

Integer 

datatypes 25, 86 

X execution unit 18 

x86 coding optimizations 478 

Y execution unit 18 

Internal 

architecture 5—22 

snooping 247 

Interrupt 165, 176, 218, 222-223, 225 

232, 254, 256, 258, 267, 288, 294 

acknowledge 141, 149, 153, 165, 167, 172, 215, 218 

acknowledge cycles 141, 144, 146, 153, 170, 182 

descriptor table register 42-43 

flag 33,165,176 

gate 50 

redirection bitmap 44 

request 165 

service routine 165, 170, 254, 257 

system management 257 

type of 51 



Interrupts 

Olh 289 

03h 289 

lOh 254 

exceptions and 51 

INTR 165 

IRQ13 255 

NMI 170 

INTR 165, 292 

INV 165 

Invalidation Request 165 

mVD Instruction 248 



KEN# 166 

Key Functionality 

3D 82 

MMX 348 



LI Cache 1, 40, 92, 134, 198, 233 

240, 244, 247, 251, 269-270, 282 

473, 491, 493, 520-521 

inhibit 282 

L2 Cache xxv, 282, 496, 522 

Latencies 

and throughput 465 

execution units and dependency 458 

Level-One Cache. See LI Cache 

Limit, Write Allocate 242 

Line KUs. Cache- 239 

Load Unit 91, 462, 467, 469-471, 489, 491-492 

LOCK# 167 

Locked 

cycles 215 

operation with BOFF# intervention 216 

operation, basic 215 

Logic 

branch 9 

branch prediction 456, 464, 473, 477 

branch-prediction 20-21, 456, 464, 473, 477 

external support of floating-point exceptimis 254 

M 

W10» 168 

Machine Check Exception 39, 511-512, 517, 519 

Maskable Interrupt 165 

Matrix Multiplication Optimization Example 487 

MCAR 39, 231 

MCTR 39-40, 231 

Memory 

or I/O 168 

read and write, misaligned single-transfer 194 

mad and write, single-transfer 192 

reads and writes 192 

ME51 1, 10, 201, 205, 234, 246, 249, 251 

bit 11, 235-236, 495 

states in the data cadie 235 

Microarchitecture 2, 82, 455-457 

enhanced RISC86. 6 

overview, AMD-K6 3D processor 5 
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Misaligned 

I/O read and write 200 

single'transfer memory read and write 194 

Mixing MMX and Floating-Point Instructions 358 

MMX xxvi, 8, 15-16, 23. 52, 82, 84, 87, 89-91 

93-94, 174, 227, 505, 510-511, 517, 519 

data types 31, 352 

exceptions 256, 358 

feature detection 355 

instruction compatibility, floating-point and 256 

instruction formats 354 

instruction set 360 

instructions 82, 84, 87, 89, 91-94, 353, 360-453 

505, SlO-512, 517, 519 

key functioaality 348 

multimedia technology 347—453 

multimedia technology ardiitecture 348 

prefixes 359 

programming considerations 355 

register set 350 

registers 31> 84, 350 

Mode, Tri-State Test 270 

Model-Specific Registers (MSR) 39 

ModR/M 

address mode 483, 496 

byte 52-53, 71, 75, 79, 134, 482, 497 

instruction 483 

instruction format 87 

MOVD Instruction 352, 362 

MOVQ Instruction 352, 363 

MPEG Decoding 82 

MSR 39 

Multimedia 

coding optimizations 482 

execution imit 18, 255 

technology, MMX 347-453 

Multiplication 

optimization example 487 



NA» 169 

Negated 139 

Next Address 169 

NMl 170, 292 

No-Connect Pins 175, 301 

Non-Maskable Interrupt 170 

Non-Pipelined 193, 239 



Operands 87-89, 458 

Operating Ranges 303 

Operation, Cache 235 

Optimization 

code 455 

coding guidelines 472 

example, 3D matrix multiplication 487 

techniques, general x86 472 

Optirai2ations 

general AMD-K6 processor x86 coding 474 

integer x86 coding 478 

multimedia coding 482 



Optimized 

15-bit precision divide 498 

15-bit precision square root 499 

24-btt precision square root 499 

full 24-bit precision divide 498 

Organization, Cache 233 

Output 

delay timings for 100-MHz bus operation 316 

delay timings for 66-MHz bus operation 320 

signals 229 

P 

Package 

specifications 343 

thermal specifications 331 

PACKSSDW Instruction 364 

PACB:SSWB Instruction 366 

PACKUSWB Instruction 369 

PADDB Instruction 372 

PADDD Instruction 374 

PADDSB Instruction 376 

PADDSW Instruction 378 

FADDUSB Instruction 380 

PADDUSWInstrurtion 382 

PADDW Instruction 384 

Page 

cache disable 171 

directory entry (PDE) 47, 236 

table entiy (PTE) 47-48, 236 

writethmugh 173 

Paging 45, 511-512, 517, 519 

PAND Instruction 386 

PANDN Instruction 388 

Parity 138, 144, 146, 155, 172, 192 

bit 144, 155, 172 

check 144-145, 155, 172 

error 145, 172, 207,273 

flags 33 

PAVGUSB Instruction 92, 97 

PCD 171, 236, 244 

PCHK# 172 

PCMFEQB Instruction 390 

PCMPEQD Instruction 392 

PCMPEQW Instruction 394 

PCMPGTB Instruction 396 

PCMPGTD Instruction 398 

PCMPGTW Instruction 400 

PF2ID Instruction 88, 90, 92, 99 

PFACC Instruction 90, 92, 101 

PFADD Instruction 90, 92, 103 

PFCMPEQ Instruction 92, 105 

PFCMPGE Instruction 92, 107 

PFCMPGT Instruction 92, 109 

PFMAX Instruction 90, 92, 111 

PFMIN Instruction 90, 92, 113 

PFMUL Instruction 87, 90, 92, 115 

PFRCP Instruction 90, 92, 117 

PFRCPm Instruction 90, 92. 119 

PFRCPIT2 Instruction 90, 92, 121 

PFRSQITl Instruction 90, 92, 123 

PFRSQRT Instruction 90, 92, 125 

PFSUB Instruction 90, 92, 127 
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PFSUBR Instruction 90, 92, 129 

PI2FDInstruclion 88, 90, 92, 131 

Pin 

connection requirements 301 

description diagram , 340 

designations 342 

Pipeline 21, 82, 91, 191, 196 

457-460, 462-163, 466 

control 20 

register X and Y 20 

six-stage 6, 8, 459 

Pipelined 10, 19, 89, 169, 191, 195, 197 

212, 233, 24S, 468, 471, 482, 498, 500, 502 

burst reads 195 

cycles 10, 142, 154 

design 18 

pair of 24-bit precision divides 498 

PMADDWD Instruction 349, 402 

PMULHRW Instruction 92, 132 

PMUUTW Insinicuon 404 

PMULLW Instruction 406 

Pointer. Instruction 27 

POPFD Instruction 507 

POR Instruction 408 

Power 

and grounding 299, 339 

connections 299 

dissipation 307 

Power-on Configuration and Initialization 227 

Precision Divide. Optimized Full 24-Bit 498 

Precision Divides, Pipelined Pair of 24-Bit 498 

Precision Square Root 

optimized 15-bit 499 

optimized 24-bit 499 

Predecode Bits 10-11, 235 

Prediction Logic, Branch 456, 464, 473, 477 

Preemptive Multitasking 357 

PREFETCH Instruction 11, 87, 134, 482 

PREFETCH/PREFETCHW Instructions 134 

Prefetching 10, 82, 245 

PREFETCH W Instruction 134 

Prefixes 

3D 94 

MMX 359 

Processors, The AMD-K6 Family of 456 

Programming 

considerations. MMX 355 

steps 493 

PSIXD Instruction 410 

PSLLQ Instruction , 412 

PSLLW Instruction 414 

PSRAD Instruction 416 

PSRAW Instruction 418 

PSRLD Instruction 420 

PSRLQ Instruction 422 

PSRLW Instruction 424 

PSUBB Instruction . , 426 

PSUBD Instruction 428 

PSUBSB Instruction 430 

PSUBSW Instruction 432 

PSUBUSB Instruction 434 

PSUBUSW Instruction 436 

PSUBW Instruction 438 

PUNPCKHBW Instruction 440 



PUNPCKHDQ Instruction 442 

PUNFCKHWD Instruction 444 

PUNPCKLBW Instruction 446 

PUNPCKLDQ Instruction 448 

PUNPCKLWD Instruction 450 

PUSHFD Instruction 507 

PWT Instruction 173 

PXOR Instruction 452 

R 

Ranges, Operating 303 

Ratings, Absolute 304 

Read and Write 

basicI/0 199 

misaligned I/O 200 

Reads, Burst Reads and Pipelined Burst 195 

Reciprocal Square Root, Square Root and 498 

Register 

boundary scan ' 273 

bypass (BR) 277 

control 34 

data types, floating-point 30, 85, 89, 506 

debug 36, 283 

EAX 513, 515-516, 518 

execution units 460 

floating-point 27 

general'purpose 23 

SYSCALUSYSRET target address (STAR) 41 

Register Set 

3D 84 

MMX 350 

Register X 

and Y Execution 461 

and Y Functional Units 20 

and Y Pipelines 20 

Execution Pipeline 91, 458 

Unit 89-90, 466 

Register Y 

Execution Pipeline 91, 458 

Unit 89-90, 467 

Registers 8, 23, 229, 256 

3D 23, 31, 84, 88, 91, 93 

descriptors and gates 48 

device identification (DIR) 276 

DR3-DR0 286 

DR5-DR4 286 

DR6 287 

DR7 287 

EFLAGS 33 

extended feature enable register (EFER) 41 

IR 273 

MCAR 39 

memory management 42 

MMX 23, 31, 84, 350 

segment 26 

STAR 41 

TAP 273 

TR12 40 

WHCR 42 

Regulator, Voltage 335 

Replacement, Cache-Line 240, 248 

Requirements, Pin Connectioo 301 

Reserved 175 
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RESET 174,228,292 

and test signal timing 324 

signals sampled during 227 

stale of processor after 229 

Resource Constraints 466 

Return Address Stack 22 

Revision Identifier. SMM 262 

RISC86 Microarchitecture 6 

RSM Instruction 263, 266-267 

RSVD 175 

$ 

Sample Code S14 

SAMPLE/PRELOAD Instruction 278 

Sampled 139 

Scheduler 

centralized 16 

instruction control unit 8 

SCYC 175 

Sector, Write to a 241 

Segment 

descriptor 26, 48-50 

registers 26 

task state 44 

usage 26 

Selectable Drive Strength 310 

Shif t-DR state 280 

SWft-IR state 280 

Shutdown Cycle 222 

Signal 

descriptions 137 

switching characteristics 313 

terminology 139 

timing* RESET and test 324 

Signals 

At20:3J 310 

A[31:31 141 

A20M# 140, 258 

ADS# 142. 310 

ADSC# 142 

AHOLD 143, 292 

AP 144 

APCHK# 145 

BE[7;01# 146 

BF[2:0J 147,296 

BOFF# 148, 213 

BRDY# 149 

BRDYC« ISO, 310 

BREQ 151 

CACHE# 152, 237 

cache-related 238 

CI.K 152 

D/C# 153 

D[63:01 154 

DP17:0J 155 

EADS# 156 

EWBE# 157, 292 

F£RR# 158, 256 

FLUSH« 159, 227, 248, 270, 292 

Hrr# 160 

HrrM# 160,310 

HLDA 161 

HOLD 152 

IGNNE# 163, 256 



INIT 164,292 

INTR 165, 292 

mV 165 

KEN# 166 

LOCKtf 167 

MflO# 168 

NA# 169 

NMI 170, 292 

output 229 

PCD 171 

FCHK# 172 

PWT 173 

RESET 174, 292 

RSVD 175 

sampled during RESET 227 

SCYC 175 

SMI* 176, 257, 292 

SMIACT# 177, 257 

STPCLK# 178, 293 

TAP 272 

TCK 179 

TDI 179 

TDO 179 

TMS 180 

TRST# 180 

VCC2DET 181 

VCC2H/L# 181 

W/RU 182, 310 

\VB/WT# 183 

SIMD 9. 82, 88, 90, 348-349, 457 

Single Instruction Multiple Data 

(SIMD) 9, 82, 348-349, 457 

Single-Transfer Memory Read and Write 192 

Six-Stage Pipeline 459 

SMI# 175, 257, 292 

SML\CT# 177, 257 

SMM 257 

base address 263 

default register values 258 

halt restart slot 264 

UO trap DWORD 265 

I/O trap restart slot 266 

operating mode 258 

revision identifier 262 

state-save area 260 

Snoop 177, 183, 197. 248-250 

Snooping 

cache 250 

internal 247 

Socket 7 XXV, 1 

Software Environment 23 

Special 

bus cycle 149, 178, 220-223, 264, 293 

cycle 157, 159, 178, 186, 198 

220, 222-223, 238, 292-293 

Specifications 

package 343 

package thermal 331 

Split Cycle 175 

Square Root 95 

and reciprocal square root 498 

optimized 15-bit precision 499 

optimized 24-bit precision 499 

square root and reciprocal 498 

Stack. Return Address 22 
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Standard Functions 50&-507, 515-516 

State Machine Diagram, Bus 189 

State of Processor 

after INTT 232 

after RESET 229 

States, Cache 246 

State-Save Area, SMM 260 

Stop 

dock 178 

clock state 223,295-296 

grant inquire state 291-293, 295 

grant State 223, 293, 295 

Store Unit 463, 467, 470, 490, 493 

STPCLK# 178,293 

Super? XXV, 1, 3-4 

platform initiative 3 

Switching Characteristics 314 

100-MUz bus operation 314 

66-MHz b\is operation 315 

input setup and hold timings for 100-MHz bus 318 

input setup and hold timings for 66-MHz bus 322 

output delay timings for 100-MHz bus 316 

output delay timings for 66-MHz bus 320 

signal 313 

valid delay, float, setup, and hold timings 316 

SYSCALL 39, 42, 69, 231, 512, 519 

SYSCALL/SYSRET 'l^rget Address Register 

(STAR) 41-42 

SYSRET 69, 512, 519 

System 

design, airflow management in a 336 

management interrupt (SMI#) 176 

management interrupt active (SMIACT#) 177 

management mode (SMM) 257 



Table, Branch History 21 

TAP 271 

TAP Controller States 

capture-DR 280 

capture-IR 280 

shift-DR 280 

shif t-]R 280 

state machine 278 

test-logic-reset 280 

update-DR 280 

update-TR 280 

TAP Instructions 277 

BYPASS 278 

EXTEST 277 

HIGHZ 278 

IDCODE 278 

SAMPLE«>RELOAD 278 

TAP Registers 273 

instruction register (IR) 273 

TAP Signals 272 

Target Cache, Branch 21 

Task 

state segment 44 

switching 82, 84, 93, 356 

TCK 179 

TDl 179 

TDO 179 



Temperature 303, 331-332, 334 

case 334 

Terminology, Signals 139 

Test 

access port, boundary-scan 271 

and debug 269 

clock 179 

data input 179 

data output 179 

-logic-reset state 280 

mode select 180 

mode, tri-state 270 

register 12 (TR12) 40 

reset 180 

Testing for 

extended functions 513 

the CPUn> instruction 506 

Thermal 307, 332-336 

design 331 

heat dissipation path 334 

Layout and airflow consideration 335 

measuring case temperature 334 

padkage specifications 331 

Throughput, Latencies and 465 

Time Stamp Counter 40, 511-512, 517, 519 

Timing Diagrams 187-226 

test signal 330 

TLB 7, 171, 234, 239, 270, 475, 514, 520-521 

TMS 180 

TR12 39-40, 231, 236-237, 243, 282 

Transition from Protected Mode to Real Mode, 

INrr-Initiated 225 

Translation Lookaside Buffer (TLB) 45, 233 

Trap Dword, I/O 265 

Tri-State Test Mode 270 

TRST# 180 

TSC 39-40, 231, 292-293 

TSS 44, 50-51, 261, 287 

u 

Unit 

3D 482, 500 

branch condition 464 

fetch 12 

floating-point 482, 500, 502, 511-512 

load 91, 462, 467, 469-471, 489, 491-492 

scheduler/instruction control 8 

store 463, 467, 470, 490, 493 

see also Execution Unit 
Units 

register execution 460 

register X and Y 20 



Values Returned by the CPUID Instruction 523 

VCC2DET 181 

VCC2H/L# 181 

Voltage 181, 188, 299, 303, 305, 310, 314 

ranges 310 

regulator 335 
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W/R# 182, 310 

WAE15M 242 

WAELIM 242 

WBAVT# 183 

WBINVD Instruction 248, 251 

WCDE 42, 242, 244 

WHCR 39, 42, 231, 242, 244 

Write 

handling control register (WHCR) 42 

to a cacheable page 241 

to a sector 241 

Write AUocatc 236, 240-242, 244, 246 

enable 42, 242 

enable limit 42, 242 

limit 242 

logic mechanisms and conditions 243-244 

Write/Read 182 



Writeback 152, 154-155, 166, 173, 177 

183, 186, 197-198, 220, 233, 239 

246, 249,251,297 

burst 197 

cache 6, 10 

cydes 140, 142-143, 157, 160, 183, 198 

205, 209-210, 212, 214, 216 

236-237,282, 295 

or writethrough 183 



Wriiethrough vs. Writeback Coherency States 251 

X 

x86 

coding optimi2ations, general AMIVKe processor 474 

coding optimizations, integer 478 

optimization techniques, general. 472 
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PC catalog 

Order loll Free 1 -800-45 1 -43 1 9 

Books and Software 



AbacuSk 

www.abacuspub.com 





To order direct call Toll Free 1-800-451-4319 



In US and Canada add $5.00 shipping and handling. Foreign orders add $13.00 per itcfn. 
Michigan residents add 6% sales tax. 
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Produciivily Series books are for users who want 
lo become more produclive with iheir PC. 



Upgrading and Maintaining Your PC 

New Sixth Edition! 

Buying a personal computer is a major investment. 
Today's fast-changing technok^ and innovations, 
such as Windows NT, ISDN cards and super fast 
compODenis. require that you upgrade to keep your 
system current. New hardware and software 
dsvelopments require more speed, more memtny 
and larger capacities. This book, is for the millions 
of PC ov^ners who want to retain their sizable 
investment in their PC system by upgrading. 

With current info on the newest technology, 
Upgrading A Maintaining Your PC starts by 
helping readers make informed purchasing 
decisions. Whether it's a larger haid drive, more 
memory or a new CD-ROM drive, you'll be able to 
buy components with confidence. 

Inside this new 6th Edition: 

• Over 200 Photos and Illustrations 

• Upgrader's guide to shopping for PC motherboards, operating 
systems, I/O cards, processors and morel 

• Windows NT Workstation 4,0, Windows 95 and OS/2 Warp 4.0 

• Processors(lntel, Cyrix, AMD and more), Internal/External cache 

• The latest video and sound cards and installation tips 

• SPECIALWINDOWS 95 SECTION! 

On the CD-ROM- 

•Wintune— Windows Magazine's system tune-up program ♦ SYSINFO— 
system^quick glance" program • Cyrix Test — Cyrix upgrade processor test • 
P90 TEST— the famous Intel Pentium "math" test • WinSlcuth— Windows 
diagnostic utility • And Much More! 




Publisher: Abacus Suggested Retail Price 

Order Item #S325 $44.93 US/$59.95 CAN 

ISBN: 1-55755-329-7 CD-ROM Software Included 



To order direct call loll Kree 1-800-451-4319 



In US & Canada Add $5.00 Shipping and Handling 
Foreign Orders Add $13.00 per item. Michigan residents add 6% sales lax. 
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Developers Series books are for prolessioiiiil sollwure dovelopers who 
recjiiire in-depth teilinieal inroriiiiition and programming techniques. 



PC intern'— 6th Edition 
The Encyclopedia of System Programmii 

Now in its 6th Edition, more than 500,000 
programmers worldwide rely on the 
authoritative and eminently understandable 
information in this one-of-a-kind volume. 
You'll find hundreds of practical, working 
examples written in assembly language, C++, 
Pascal and Visual Basic — all professional 
programming techniques which you can use 
in your own programs. PC INTERN is a 
literal encyclopedia for the PC progranmier. 
PC INTERN clearly describes the aspects of 
programming under all versions of DOS as 
well as interfacing with Windows. 

Some of the topics include: 

• Memory organization on the PC 

• Writing resident TSR programs 

• Programming graphic and video cards 

• Using extended and expanded memory 

• Handling interrupts in different languages 

• Networking programming NetBIOS and IPX/SPX 

• Win95 virtual memory and common controls 

• IRQs — ^programming interrupt controllers 

• Understanding DOS structures and function 

• Using protected mode, DOS extenders and DPMWCPI multiplexer 

• Programming techniques for CD-ROM drives 

• Programming Sound Blaster and compatibles 

The companion CD-ROM transforms PC INTERN from a book into an interactive 
reference. You can search, navigate and view the entire text of the book and have 
instant access to information using hypertext links. Also included on the CD-ROM 
are hundreds of pages of additional progranuning tables and hard-to-tlnd material. 

Author: Michael Tischer and Bruno Jennrich 

Order Item: #B304 5RP: $69.95 U5/$99.95 CAN 

ISBN: 1 -55755-304-1 with componion CD-ROM 



To order direct call Toll Free 1-800-451-4319 



In US and Canada add $5.00 shipping and handling. Foreign orders add $1 3.00 per item. 
Michigan residents add 6% sales tax. 
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Productivity Series 
become more productive with your PC 



Nets and Intranets With Win9S 

Getting Connected 

Windows 95 has a surprisingly rich sei of 
networking capabilities. Built-in networking 
delivers an affordable and easy way to connect with 
others and benefit by sharing resources — files, 
printers, and peripherals. Network sharing saves 
you and your organization time and money and 
adds convenience. 

Another great benefit of Windows 95 Networking 
is its ability to let you run an Intranet. This book 
and companion CD-ROM has all the pieces that 
you'll need to set up your own internal World Wide 
Web server (Intranet) without the expense of using 
an outside Internet Service Provider. 

• A practical hands-on guide for setting up a small 
network or Intranet using Win95 or Windows for 
Workgroups 3.1 1. 

• Take advantage of Windows 95*s built in options so you 
can immediately use its networking features — 

Shared printers 
Easy-to-use groupware 
E-mail and faxes 
Additional hard drive capacity 
Centralized backups 
TCP/IP 

• Step-by-step guide to getting and staying connected 
whether you*re in a small office, part of a woricgroup, or 
connecting from home. 

• Perfea for the company wanting to get connected and 
share information with employees inexpensively 

Author: H.D.Radke 
ltem#:B3ll 
ISBN: 1-55755-31-4 
SRP: $39.95 US/54.95 CAN 
with CD-ROM 




CD-ROM 
Included 



Order Direct Toll Free 1-800-451-4319 



In US and Canada add $5.00 shipping and handling. Foreign orders add $13.00 per Item. 
Michigan residents add 6% sales tax. 
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Pmduclivily Scries books arc for users who wanl 
10 become more productive with iheir PC. 



McAfee Anti- Virus for Beginners 



In 1986 there was one vims: today there are 
7400. McAfee Anti-Virus for Beginners offers an 
introduction to protecting and securing your 
system from viruses. 
The first step lo creating a safe system is 
understanding what viruses are, how they replicate, 
how they're contracted, how ihcy hide from 
detection, what 'triggers' are, and what the 
*payIoad' is — what happens when a virus is 
activated. You'll Icam how to recognize the 
syirtptoms of a virus, then learn to use two of 
McAfee's most popular products: VinisScan and 
^^tbScan. 

This boolc will show you how to install and 
configure fully-functional evaluation versions of 
both VinjsScan and WebScan from the CD-ROM. 
You'll learn how to use VirusScan to detect and 
remove viruses that may already be present, and 
how to protect yuur systepi in the future. 
Topics discussed include: 

• What viruses are 

• How anti-virus software works 

• Installing & configuring anti-virus software 

• Detecting viruses and cleaning your system 

• Internet virus concerns 

• The importance of regular anti-virus software updates 




On the CD-ROM- 

The companion CD-ROM contains fully functional evaluation 
versions of McAfee's most popular anti- virus programs: VirusScan 
and WebScan. Use these programs to safeguard your PC on and off 
the Internet. 



Author: Brian Howard 
Order Item #8318 
ISBN: 1-55755-318-1 



Suggested Retail Price 
$19.95 US;/$26,95 CAN 
CD-ROM Included 



To order direct call Toll Free 1-800-451-4319 



In US & Canada Add $5.00 Shipping and Handling 
Foreign Orders Add $13.00 per item. Michigan residents add 6% sales tax. 



177AMD0060575 



About This Book and CD-ROM 



The Book 

At the time of publication, AMD had not made final naming decisions for the 
processor and the 3D technology. The names used in this book are the AMD code 
names for the processor and the 3D technology. 

Refer to Appendix B, "Code Optimization" on page 455 for details regarding the 
examples shown in the AMD-K6 3D simulator, especially the tables beginning with 
Table 87 on page 468. 

Refer to the AMD web site at www.amdxom for updates to material related to this 
book, including new scripts and updates for the AMD-K6 3D simulator. 

CD-ROM Contents 

The CD-ROM included with this book contains the following: 

■ AMD-K6 3D processor simulator 

■ All AMD processor technical documentation in Adobe Acrobat PDF format 

■ Adobe Acrobat Reader for most platforms 

Minimum System Requirements 

The following minimum system is required in order to run the AMD-K6 3D processor 
simulator. 

■ A 133-MHz AMD-K6 processor 

■ 16 Mbytes of memory 

■ Windows® 95 or Windows NT*™ 4.0 operating system 

■ 256-color SVGA graphics mode video 

■ Video resolution of 800 by 600 

We recommend the following system for maximum enjoyment when using the 
simulator: 

■ A 166-MHz AMD-K6 processor or better 

■ 32 Mbytes of memory 

■ Windows 95 or Windows NT 4.0 operating system 

■ 65,S36-color graphics mode or better 

■ Tideo resolution of 1024 by 768 

■ A sotmd card with speakers 

Installation of the CD-ROM 

A setup program is provided for your convenience. Run install.exe from the root 
directory of the CD-ROM, and the installation program will step you through installing 
the simulator and the Acrobat reader, if you need it. 
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Installing the CD-ROM 



A s(:lup program is provided for your convenience. 
Run inslall.exe from the root directory of the CD- 
ROM, and die installation program will step you 
through installing the simulator and the Acrobat 
reader if you need it. 



Special versions of the Adobe Acrobat Reader are 
included on the CD-ROM for most platforms. 
These special versions of the reader include the 
search plug-in so you may utilize the global search 
functions. 



The AMD-K6 3D Processor Simulator 

The AMD-K6 3D processor simulator is a 
multimedia application that enables you to see a 
conceptual view of x86 code as it flows through 
the internal functional units of the processor. The 
simulator includes a tutorial/help system. You may 
select from several scenarios that animate code 
flowing through the processor. 
The simulator also has a script authoring mode that 
allows you to write your own scripts. Using the 
scripting language is explained in detail in the 
Scripting Users Guide. The users guide is accessed 
by opening the file named amdk6script.pdf found 
on the CD-ROM in the \AMDK6SIM directory. 

Refer to Appendix B,. "Code Optimization" for 
details regarding the examples shown in the AMD- 
K6 3D ^simulator; especially the table's beginning 
with Table 87 on page 468. 
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This book describes the new 
AMD-K6-2 Processor! 

Revolutionary Multimedia Performance 



Revolutionary Multimedia 

Performance 

The AMD-K6 3D Processor: 

Learn how it boosts your PC video and 
multimedia performance by 50%. 
Learn powerful solutions for creating a 
more entertaining and productive PC 
platform. 

Break through existing bottlenecks with 
multimedia and floating-point-intensive 
applications. 

^^AMD listened to their customers, 
and they have implemented a real 
improvement to the x86 instruction 
set. You will be able to see this 
improvement in the performance of 
the AMD-K6 3D processor. The 
improvement is not trivial, " 
.Tohn C. Dvorak 



The AMD-K6 3D Processor is written for PC users 
interested in the latest advance in the computer industry. 
The 3D graphic capabihty of the AMD-K6 3D is truly 
revolutionary, delivering on the promises made but never 
realized by MMX*^^' — dramatic and real improvements in 
multimedia performance at the desktop. 

With AMD-K6 3D technology, new, more powerful 
hardware and software applications enable a more 
entertaining and productive PC platform. Improvements 
include faster frame rates on high-resolution scenes, 
superior modeling of real-world environments and 
physics, sharper and more detailed 3D imaging, smoother 
video playback and near-theater-qualily audio. 

The AMD-K6 3D Processor describes the operation of 
the AMD-K6 3D CPU. In addition to the 3D graphic 
capability, you'll learn the internal architecture of the 
AMD-K6-2 processor with hundreds of illustrations, charts 
and tables. . 

The AMP-K6 3D Processor includes: . 

The AMP-K6 3>D instruction set definitions 
Examples of how the AMD-K6 3D signals interact 

external devices 
Definitions of the complete MMX instruction set 
Examples showing the internal operation of the 
processor 



The CD-ROM features an interactive sirriUlation 
showing the operation of the AMD-K6 3D 
processor as it executes instructions. See for 
yourself how AMD is making your PC 
SCREAM!!/ Jrr^^-^ 

. IC4 
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You can count on us 




5370 52nd StreciSE Grand Rapids, Mi 49512 
www. abac uspub.com 



$34.95 U.S. 
$46.95 CAN 



Level: Intermediate-Advanced 
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